ToModAPI: Topic Modeling API

This API is built to dynamically perform training, inference, and evaluation for different topic modeling techniques. The API grant common interfaces and command for accessing the different models, make easier to compare them.

A demo is available at http://hyperted.eurecom.fr/topic.

Models

In this repository, we provide:

Code to perform training, inference, and evaluation for 9 Topic Modeling packages:
- LDA from the Mallet package.
- LFTM - paper.
- Doc2Topic
- GSDMM - paper
- Non-Negative Matrix factorization (NMF)
- Hierarchical Dirichlet Processing Model (HDP)
- Latent Semantic Indexing (LSI)
- Paragraph Vector Topic Model (PVTM) - paper
- Context Topic Model (CTM) - paper
A set of pre-trained models, downloadable from here. **NOTE: Newly trained models are by default stored in .\models, replacing the old ones, unless a new model path is given **
Data files containing pre-processed corpus:
- 20ng.txt and 20ng_labels.txt, with 11314 news from the 20 NewsGroup dataset
- ted.txt with 51898 subtitles of TED Talks
- test.txt and test_labels.txt, an extraction of 30 documents from 20_ng.txt, used for testing reason

Each model expose the following functions:

Training the model

m.train(data, num_topics, preprocessing) # => 'success'

Print the list of computed topics

for i, x in enumerate(m.topics): print(f'Topic {i}') for word, weight in zip(x['words'], x['weights']): print(f'- {word} => {weight}')

Access to the info about a specific topic

x = m.topic(0) words = x['words'] weights= x['weights']

Access to the predictions computed on the training corpus

for i, p in enumerate(m.get_corpus_predictions(topn=3)): # predictions for each document print(f'Predictions on document {i}') for topic, confidence in p: print(f'- Topic {topic} with confidence {confidence}') # - Topic 21 with confidence 0.03927058187976461

Predict the topic of a new text

pred = m.predict(text, topn=3) for topic, confidence in pred: print(f'- Topic {topic} with confidence {confidence}') # - Topic 21 with confidence 0.03927058187976461

Computing the coherence against a corpus

# coherence: Type of coherence to compute, among <c_v, c_npmi, c_uci, u_mass>. See https://radimrehurek.com/gensim/models/coherencemodel.html#gensim.models.coherencemodel.CoherenceModel pred = m.coherence(mycorpus, metric='c_v') print(pred) #{ # "c_v": 0.5186710138972105, # "c_v_std": 0.1810477961008996, # "c_v_per_topic": [ # 0.5845048872767505, # 0.30693460230781777, # 0.2611738203246824, # ... # ] #}

Evaluating against a grount truth

# metric: Metric for computing the evaluation, among <purity, homogeneity, completeness, v-measure, nmi>. res = m.get_corpus_predictions(topn=1) v = m.evaluate(res, ground_truth_labels, metric='purity') # 0.7825333630516738

The possible parameters can differ depending on the model.

Use in a Python enviroment

Install this package

pip install tomodapi

Use it in a Python script

from tomodapi import LdaModel # init the model m = LdaModel(model_path=path_location) # train on a corpus m.train(my_corpus, preprocessing=False, num_topics=10) # infer topic of a sentence best_topics = m.predict("In the time since the industrial revolution the climate has increasingly been affected by human activities that are causing global warming and climate change") topic,confidence = best_topics[0] # get top words for a given topic print(m.topic(topic)) #

If the model_path is not specified, the library will load/save the model from/under models/<model_name>.

Web API

A web API is provided for accessing to the library as a service

Install dependencies

You should install 2 dependencies:

mallet 2.0.8 to be placed in tomodapi\
glove.6B.50d.txt to be placed in tomodapi\glove

Under UNIX, you can use the download_dep.sh script.

sh download_dep.sh

Start the server

python server.py

Docker

Alternatively, you can run a docker container with

docker-compose -f docker-compose.yml up

The container uses mounted volumes so that you can easily update/access to the computed models and the data files.

Manual Docker installation

docker build -t hyperted/topic . docker run -p 27020:5000 --env APP_BASE_PATH=http://hyperted.eurecom.fr/topic/api -d -v /home/semantic/hyperted/tomodapi/models:/models -v /home/semantic/hyperted/tomodapi/data:/data --name hyperted_topic hyperted/topic # Uninstall docker stop hyperted_topic docker rm hyperted_topic docker rmi hyperted/topic

Publications

If you find this library or API useful in your research, please consider citing our papers:

Pasquale Lisena P, Ismail Harrando I., Oussama Kandakji O. & Raphaël Troncy. ToModAPI: A Topic Modeling API to Train, Use and Compare Topic Models. In 2nd Workshop for Natural Language Processing Open Source Software (NLP-OSS), November 19, 2020. - paper - BIB
Ismail Harrando, Pasquale Lisena and Raphaël Troncy. Apples to Apples: A Systematic Evaluation of Topic Models. In Recent Advances in Natural Language Processing (RANLP), September 2021. - BIB - appendix

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
asrael		asrael
data		data
models		models
notebooks		notebooks
tomodapi		tomodapi
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
appendix.pdf		appendix.pdf
docker-compose.yml		docker-compose.yml
download_dep.sh		download_dep.sh
params.md		params.md
preprocess.py		preprocess.py
project.ini		project.ini
requirements.txt		requirements.txt
server.py		server.py
setup.cfg		setup.cfg
setup.py		setup.py
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ToModAPI: Topic Modeling API

Models

Training the model

Print the list of computed topics

Access to the info about a specific topic

Access to the predictions computed on the training corpus

Predict the topic of a new text

Computing the coherence against a corpus

Evaluating against a grount truth

Use in a Python enviroment

Web API

Install dependencies

Start the server

Docker

Manual Docker installation

Publications

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 5

Uh oh!

Languages

License

D2KLab/ToModAPI

Folders and files

Latest commit

History

Repository files navigation

ToModAPI: Topic Modeling API

Models

Training the model

Print the list of computed topics

Access to the info about a specific topic

Access to the predictions computed on the training corpus

Predict the topic of a new text

Computing the coherence against a corpus

Evaluating against a grount truth

Use in a Python enviroment

Web API

Install dependencies

Start the server

Docker

Manual Docker installation

Publications

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 5

Uh oh!

Languages

Packages