Using Knowledge Graphs in Data Science - From Symbolic to Latent Representations (and a Few Steps Back)

8/11/21 Heiko Paulheim 1 Using Knowledge Graphs in Data Science – From Symbolic to Latent Representations and a few Steps Back Heiko Paulheim University of Mannheim Heiko Paulheim

8/11/21 Heiko Paulheim 2 Brief Introduction 2006 2008 2011 2013 2014 2017 Pre PhD Years PhD Years PostDoc Years Assistant Prof. Full Prof. SDType rdf2vec ReNewRS Kare§KoKI MELT

8/11/21 Heiko Paulheim 3 Knowledge Graphs: At a Glance • Graph shaped knowledge representation – nodes: entities – edges: relations University of Mannheim Mannheim Baden- Württemberg Germany Heiko Paulheim DWS Group employer a f f il i a t io n part of residence s t a t e part of

8/11/21 Heiko Paulheim 4 Knowledge Graphs in Organizations • Knowledge Graphs are used… • …in companies and organizations – collect, organize, and integrate knowledge – link isolated information sources – make information searchable and findable Masuch et al., 2016

8/11/21 Heiko Paulheim 5 Public Knowledge Graphs • Knowledge Graphs are used… • …as (free), public resources – collect common knowledge – general purpose, not task specific – make it easy to build knowledge-intensive applications

8/11/21 Heiko Paulheim 6 Usage of Public Knowledge Graphs OK, Google, when will the final season of Money Heist be on Netflix? The fifth season of Money Heist will be released on September 3rd .

8/11/21 Heiko Paulheim 7 Usage of Public Knowledge Graphs 2021-09-03 2020-04-03 release date release date has part h a s p a r t OK, Google, when will the final season Money Heist be on Netflix? . . .

8/11/21 Heiko Paulheim 8 Usage of Public Knowledge Graphs 2021-09-03 2020-04-03 release date release date creator has part h a s p a r t cast c a s t creator c a s t Are there any other series by the same creator? creator cast cast . . . . . .

8/11/21 Heiko Paulheim 9 History: CyC • The beginning – Encyclopedic collection of knowledge – Started by Douglas Lenat in 1984 – Estimation: 350 person years and 250,000 rules should do the job of collecting the essence of the world’s knowledge • The present (as of June 2017) – ~1,000 person years, $120M total development cost – 21M axioms and rules – Declared “ready to use” in 2017

8/11/21 Heiko Paulheim 10 History: Freebase • The 2000s – Freebase: collaborative editing – Schema not fixed • Present – Acquired by Google in 2010 – Powered first version of Google’s Knowledge Graph – Shut down in 2016 – Partly lives on in Wikidata (see in a minute)

8/11/21 Heiko Paulheim 11 History: Wikidata • The 2010s – Wikidata: launched 2012 – Goal: centralize data from Wikipedia languages – Collaborative – Imports other datasets • Present – One of the largest public knowledge graphs – Includes rich provenance

8/11/21 Heiko Paulheim 12 History: DBpedia & co. • The 2010s – DBpedia: launched 2007 – YAGO: launched 2008 – Extraction from Wikipedia using mappings & heuristics • Present – Two of the most used knowledge graphs – ...with Wikidata catching up

8/11/21 Heiko Paulheim 13 History: NELL • The 2010s – NELL: Never ending language learner – Input: ontology, seed examples, text corpus – Output: facts, text patterns – Large degree of automation, occasional human feedback • Until 2018 – Continuously ran for ~8 years – New release every few days http://rtw.ml.cmu.edu/rtw/overview

8/11/21 Heiko Paulheim 14 Knowledge Graph Creation • Sources for generating knowledge graphs: – Manual (also: crowd sourcing) curation • Cyc, Freebase, Wikidata, ... – (Semi-)structured knowledge (Wikis, databases, …) • DBpedia, YAGO, BabelNet, ... – Unstructured text or web page collections • NELL, DeepDive, ReVerb, …

8/11/21 Heiko Paulheim 15 Knowledge Graph Creation – Ongoing Projects • WebIsA & WebIsALOD – 400M hypernyms extracted from a Web Crawl Seitner et al. (2016): A Large DataBase of Hypernymy Relations Extracted from the Web

8/11/21 Heiko Paulheim 16 Knowledge Graph Creation – Ongoing Projects • DBkWik – Harvesting data from 400k Wikis Paulheim & Hertling (2018): DBkWik: A consolidated knowledge graph from thousands of Wikis

8/11/21 Heiko Paulheim 17 Knowledge Graph Creation – Ongoing Projects • CaLiGraph – Learning analogies, e.g., from lists Heist (2018): Towards Knowledge Graph Construction from Entity Co-occurrence

8/11/21 Heiko Paulheim 18 Use Cases for Knowledge Graphs • Background Knowledge – e.g., company data (address, CEO, branch, …) → SAP CRM (BSc thesis 2019) – e.g., geographic regions (demographics) → for example, sales data prediction – data interpretation (e.g., Excel tables, business models) → PhD thesis under supervision • Data Integration – unified view of different data sources – relating business entities in different systems – cross-source data visualization and analytics

8/11/21 Heiko Paulheim 19 Knowledge Graphs in Data Science • Typical cases: – predictive modeling, information retrieval, recommendation, … • For all of those, there’s sophisticated implementations – but... ?

8/11/21 Heiko Paulheim 20 Wanted: A Bridge between Both Worlds

8/11/21 Heiko Paulheim 21 Wanted: A Bridge between Both Worlds • Data Science tools for prediction etc. – Python, Weka, R, RapidMiner, … – Algorithms that work on vectors, not graphs • Bridges built over the past years: – FeGeLOD (Weka, 2012), RapidMiner LOD Extension (2015), Python KG Extension (2021) ?

8/11/21 Heiko Paulheim 22 Wanted: A Bridge between Both Worlds • Transformation strategies (aka propositionalization) – e.g., types: type_horror_movie=true – e.g., data values: year=2011 – e.g., aggregates: nominations=7 ?

8/11/21 Heiko Paulheim 23 Wanted: A Bridge between Both Worlds • Observations with simple propositionalization strategies – Even simple features (e.g., add all numbers and types) can help on many problems – More sophisticated features often bring additional improvements • Combinations of relations and individuals – e.g., movies directed by Steven Spielberg • Combinations of relations and types – e.g., movies directed by Oscar-winning directors • … – But • The search space is enormous! • Generate first, filter later does not scale well

8/11/21 Heiko Paulheim 24 Wanted: A Bridge between Both Worlds • Excursion: word embeddings – word2vec proposed by Mikolov et al. (2013) – predict a word from its context or vice versa • Idea: similar words appear in similar contexts, like – Jobs, Wozniak, and Wayne founded Apple Computer Company in April 1976 – Google was officially founded as a company in January 2006 – usually trained on large text corpora • projection layer: embedding vectors

8/11/21 Heiko Paulheim 25 From Word Embeddings to Graph Embeddings • Basic idea: – extract random walks from an RDF graph: Mulholland Dr. David Lynch US – feed walks into word2vec algorithm • Order of magnitude (e.g., DBpedia) – ~6M entities (“words”) – start up to 500 random walks per entity, length up to 8 → corpus of >20B tokens • Result: – node embeddings – most often outperform other propositionalization techniques director nationality

8/11/21 Heiko Paulheim 26 A First Glance at RDF2vec Embeddings • Observation: close projection of similar entities

8/11/21 Heiko Paulheim 27 Random vs. non-random • Maybe random walks are not such a good idea – They may give too much weight on less-known entities and facts • Strategies: – Prefer edges with more frequent predicates – Prefer nodes with higher indegree – Prefer nodes with higher PageRank – … – They may cover less-known entities and facts too little • Strategies: – The opposite of all of the above strategies • External signals (e.g., human notions of importance) – generally work better than graph-internal signals Cochez et al. (2017): Biased Graph Walks for RDF Graph Embeddings Al Taweel and Paulheim (2020): Towards Exploiting Implicit Human Feedback for Improving RDF2vec Embeddings

8/11/21 Heiko Paulheim 28 Local Embeddings • Recap: order of magnitude (e.g., DBpedia) – ~6M entities (“words”) – start up to 500 random walks per entity, length up to 8 → corpus of >20B tokens – “Train once, reuse often” • In some cases, only a small subset (of 6M) is of interest – RDF2vec light: “train when needed” – Runtime: minutes instead of days Portisch et al. (2020): RDF2Vec Light – A Lightweight Approach for Knowledge Graph Embeddings

8/11/21 Heiko Paulheim 29 RDF2vec: Example Applications • Data Model Matching with WebIsA and RDF2vec Portisch et al. (2019): Evaluating ontology matchers on real-world financial services data models.

8/11/21 Heiko Paulheim 30 RDF2vec: Example Applications • Entity disambiguation: linking texts to a knowledge graph Türker et al. (2019): Knowledge-Based Short Text Categorization Using Entity and Category Embedding

8/11/21 Heiko Paulheim 31 RDF2vec: Example Applications • Finding related research papers on CoViD-19 Steenwinckel et al. (2020): Facilitating COVID-19 Meta-analysis Through a Literature Knowledge Graph

8/11/21 Heiko Paulheim 32 RDF2vec: Example Applications • Table search by keyword Zhang and Balog (2018): Ad Hoc Table Retrieval using Semantic Similarity.

8/11/21 Heiko Paulheim 33 RDF2vec: Example Applications • Predicting biological interactions Sousa et al. (2021): Supervised Semantic Similarity.

8/11/21 Heiko Paulheim 34 RDF2vec: Example Applications • Zero-Shot Image Classification Tristan Hascoet et al. (2017): Semantic Web and Zero-Shot Learning of Large Scale Visual Classes.

8/11/21 Heiko Paulheim 35 Embeddings for Link Prediction • RDF2vec example – similar instances form clusters, direction of relation is ~stable – link prediction by analogy reasoning (Japan – Tokyo ≈ China – Beijing) Ristoski & Paulheim: RDF2vec: RDF Graph Embeddings for Data Mining. ISWC, 2016

8/11/21 Heiko Paulheim 36 Embeddings for Link Prediction • In RDF2vec, relation preservation is a by-product • TransE (and its descendants): direct modeling – Formulates RDF embedding as an optimization problem – Find mapping of entities and relations to Rn so that • across all triples <s,p,o> Σ ||s+p-o|| is minimized • try to obtain a smaller error for existing triples than for non-existing ones Bordes et al: Translating Embeddings for Modeling Multi-relational Data. NIPS 2013. Fan et al.: Learning Embedding Representations for Knowledge Inference on Imperfect and Incomplete Repositories. WI 2016

8/11/21 Heiko Paulheim 37 Link Prediction vs. Node Embedding • Hypothesis: – Embeddings for link prediction also cluster similar entities – Node embeddings can also be used for link prediction Portisch et al. (under review): Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction - Two Sides of the Same Coin?

8/11/21 Heiko Paulheim 38 Similarity vs. Relatedness • Closest 10 entities to Angela Merkel in different vector spaces Portisch et al. (under review): Knowledge Graph Embedding for Data Mining vs. Knowledge Graph Embedding for Link Prediction - Two Sides of the Same Coin?

8/11/21 Heiko Paulheim 39 Similarity vs. Relatedness • (s-)RDF2vec allows an explicit trade off w/ different walk strategies Mannheim Baden- Württemberg Germany Adler Mannheim SAP Arena Reiss- Engelhorn -Museum location location location federal state country location city stadium Knowledge Graph Walk Generation Adler_Mannheim → city → Mannheim → country → Germany Adler_Mannheim → stadium → SAP_Arena → location → Mannheim SAP_Arena → location → Mannheim → country → Germany ... “Classic” RDF2vec walks city → Mannheim → country stadium → SAP_Arena → location location → Mannheim → country ... s-RDF2vec walks + RDF2vec “union walks” RDF2vec “classic” RDF2vec “edge” concatenated vector Global PCA Test Cases concatenated vector (task-specific subset) w 2 w 1 (weighted) local PCA Portisch et al. (under review): s-RDF2vec: Injecting Knowledge Graph Structure Into RDF2vec Entity Embeddings.

8/11/21 Heiko Paulheim 40 Similarity vs. Relatedness • s-RDF2vec – using different walk strategies – combining different vector spaces (weighted combinations are possible) • 10 closest neighbors to Mannheim: Portisch et al. (under review): s-RDF2vec: Injecting Knowledge Graph Structure Into RDF2vec Entity Embeddings.

8/11/21 Heiko Paulheim 41 Similarity vs. Relatedness • Recap word embeddings: – Jobs, Wozniak, and Wayne founded Apple Computer Company in April 1976 – Google was officially founded as a company in January 2006 • Graph walks: – Hamburg → country → Germany → leader → Angela_Merkel – Germany → leader → Angela_Merkel → birthPlace → Hamburg – Hamburg → leader → Peter_Tschentscher → residence → Hamburg Germany Angela_Merkel Hamburg birthPlace country leader Peter_Tschentscher leader residence country

8/11/21 Heiko Paulheim 42 Similarity vs. Relatedness • Surrounding entities indicate relatedness – Hamburg → country → Germany → leader → Angela_Merkel – Germany → leader → Angela_Merkel → birthPlace → Hamburg • Same entities in similar positions indicate similarity – Germany → leader → Angela_Merkel → birthPlace → Hamburg – Hamburg → leader → Peter_Tschentscher → residence → Hamburg • Someone is a leader vs. something has a leader • Solution approach: use embedding approach that respects positions – CWINDOW / Structured Skip-ngram Portisch and Paulheim (2021): Putting RDF2vec in Order.

8/11/21 Heiko Paulheim 43 Similarity vs. Relatedness • Why bother? – Use case: table interpretation (a special case of entity disambiguation) related similar

8/11/21 Heiko Paulheim 44 Back to Interpretability • Hot topic: Explainable AI – Knowledge Graphs are a favorable ingredient – Human/machine interpretable knowledge → explainable systems • However: – Embeddings replace interpretable axioms with numeric vectors over non-interpretable dimensions – Where did the semantics go? Paulheim (2018): Make Embeddings Semantic Again!

8/11/21 Heiko Paulheim 45 Towards Semantic Vector Space Embeddings cartoon superhero Paulheim (2018): Make Embeddings Semantic Again!

8/11/21 Heiko Paulheim 46 Towards Semantic Vector Space Embeddings cartoon superhero • Approach 1: learn interpretation function • Each dimension of the embedding model is a target for a separate learning problem • Learn a function to explain the dimension • E.g.: • Just an approximation used for explanations and justifications y≈−|∃character .Superhero|

8/11/21 Heiko Paulheim 47 Towards Semantic Vector Space Embeddings cartoon superhero • Approach 2: learn inherently interpretable embeddings • Step 1: learn typical patterns that exist in a knowledge graph – e.g., graph pattern learning – e.g., Horn clauses • Step 2a: use those patterns as embedding dimensions – probably not low dimensional • Step 2b: compact the space – e.g., use dimensions for mutually exclusive patterns

8/11/21 Heiko Paulheim 48 Towards Semantic Vector Space Embeddings • Different angle: learn interpretation for similarity function ~similar type ~same country ~connected to same entity

8/11/21 Heiko Paulheim 49 Summary • Knowledge Graphs are a versatile ingredient for AI – Integrated view on data – Large-scale free source of background knowledge • Knowledge Graph Embeddings – Effective processing of large-scale knowledge sources – Encoding of similarity and/or relatedness • RDF2vec: explicit trade-off is possible! – Additional insights that are not explicit in the graph • aka latent semantics

8/11/21 Heiko Paulheim 50 More on RDF2vec • Collection of – Implementations – Pre-trained models – >40 use cases in various domains

8/11/21 Heiko Paulheim 51 Thank you! http://www.heikopaulheim.com @heikopaulheim

8/11/21 Heiko Paulheim 52 Using Knowledge Graphs in Data Science – From Symbolic to Latent Representations and a few Steps Back Heiko Paulheim University of Mannheim Heiko Paulheim

Using Knowledge Graphs in Data Science - From Symbolic to Latent Representations (and a Few Steps Back)

More Related Content

What's hot

Similar to Using Knowledge Graphs in Data Science - From Symbolic to Latent Representations (and a Few Steps Back)

More from Heiko Paulheim

Recently uploaded

Using Knowledge Graphs in Data Science - From Symbolic to Latent Representations (and a Few Steps Back)