Mining neighbors, paths, and path patterns from a knowledge graph and a set of seed nodes
A detailed description of the motivation and the algorithms of kgpm is available in the related article.
When citing kgpm, please use the following reference:
Pierre Monnin, Emmanuel Bresso, Miguel Couceiro, Malika Smaïl-Tabbone, Amedeo Napoli, and Adrien Coulet. "Tackling scalability issues in mining path patterns from knowledge graphs: a preliminary study". In: 1st international conference "Algebras, graphs and ordered sets" (ALGOS 2020). Ed. by Miguel Couceiro, Pierre Monnin, and Amedeo Napoli. Nancy, France, Aug. 2020. url: https://arxiv.org/pdf/2007.08821.pdf.
@inproceedings{Monnin2020kgpm, author = {Monnin, Pierre and Bresso, Emmanuel and Couceiro, Miguel and Sma{\"i}l-Tabbone, Malika and Napoli, Amedeo and Coulet, Adrien}, title = {{Tackling scalability issues in mining path patterns from knowledge graphs: a preliminary study}}, editor = {Miguel Couceiro and Pierre Monnin and Amedeo Napoli}, booktitle = {{1st international conference ``Algebras, graphs and ordered sets'' (ALGOS 2020)}}, address = {Nancy, France}, year = {2020}, month = Aug, url = {https://arxiv.org/pdf/2007.08821.pdf}, } Python script to query a knowledge graph and perform its canonicalization. The script outputs:
- Files representing the canonical knowledge graph (in
rdf_to_canonical_index,canonical_to_rdf_index,canonical_graph_adjacency,canonical_graph_inv_adjacency,rdf_nodes_cache_manager.csv,predicates_cache_manager.csv) - Statistics about the knowledge graph before and after canonicalization (in
graphs_statistics.md)
Parameters:
--configuration: path of the JSON configuration file--max-rows: max number of rows the SPARQL endpoint can return--output: base directory for output files--self-signed-ssl: enable self signed SSL certificates--debug: print debug statements
Python script to mine neighbors, paths, and path patterns from a canonical knowledge graph and a set of seed nodes.
Parameters:
--configuration: path of the JSON configuration file--graph: base directory for the input graph files--dataset-csv: CSV file with the seed nodes URIs (column 0) and class labels (column 1)--dataset-name: name of the data set--output: base directory for output files (statistics, scipy matrice of nodes x features, column name file, and a numpy vector of class labels)-d: maximum degree to allow expansion (disabled withd = -1)--lmin: minimum support for features--lmax: maximum support for features--kmin: minimum k to test (i.e., number of traversed edges, size of paths and path patterns)--kmax: maximum k to test--tmin: minimum t to test (i.e., level for generalization in class hierarchies);t = -1disables type generalization,t = 0only allows to generalize withowl:Thing--tmax: maximum t to test--undirected: whether only out arcs (false) or all arcs (true) are traversed--meaningful: biomedical additional filtering strategies:p: only select features containing a pathwayg: only select features containing a gene or a GO classm: only select features containing a MeSH classpg: disjunction ofpandgpgm: disjunction ofp,g, andmall: test all previous filters (thus, 5 outputs)no_check: disable the additional filtering
--debug: print debug statements
Python script to compute the statistics about the subgraph accessible from a set of seed nodes in a canonical knowledge graph. It outputs a markdown file containing the number of neighbors and types reachable from the seed nodes.
Parameters:
--configuration: path of the JSON configuration file--graph: base directory for the input graph files--dataset-csv: CSV file with the seed nodes URIs (column 0) and class labels (column 1)--dataset-name: name of the data set--output: base directory for output files (Markdown files)-d: maximum degree to allow expansion (disabled withd = -1)--undirected: whether only out arcs (false) or all arcs (true) are traversed--detailed: enable detailed statistics, i.e., number of neighbors and types accessible w.r.t. k and t until full neighborhood is reached. By default, only the max numbers of reachable neighbors and types in the full neighborhood are output (k and t are not given).
An example of a JSON configuration file is given in configuration.json.example. Keys are:
- server-address: address of the SPARQL endpoint to query
- url-json-conf-attribute: URL attribute to use to get JSON results
- url-json-conf-value: value of the url-json-conf-attribute to get JSON results
- url-default-graph-attribute: URL attribute to use to define the default graph
- url-default-graph-value: value of url-default-graph-attribute to define the default graph
- url-query-attribute: URL attribute to use to define the query
- timeout: timeout value for HTTP requests
- username: username to use if HTTP authentication is required (empty otherwise)
- password: password to use if HTTP authentication is required (empty otherwise)
- path_predicates_blacklist: blacklist of URIs or prefixes of predicates not to traverse
- types_blacklist: blacklist of URIs or prefixes of types not to use in path generalization
- types_expansion_blacklist: blacklist of URIs or prefixes of types whose instances cannot be traversed
- tqdm
- numpy
- bitarray
- scipy