This API is built to dynamically perform training, inference, and evaluation for different topic modeling techniques. The API grant common interfaces and command for accessing the different models, make easier to compare them.
A demo is available at http://hyperted.eurecom.fr/topic.
In this repository, we provide:
- Code to perform training, inference, and evaluation for 9 Topic Modeling packages:
- A set of pre-trained models, downloadable from here. **NOTE: Newly trained models are by default stored in
.\models, replacing the old ones, unless a new model path is given ** - Data files containing pre-processed corpus:
20ng.txtand20ng_labels.txt, with 11314 news from the 20 NewsGroup datasetted.txtwith 51898 subtitles of TED Talkstest.txtandtest_labels.txt, an extraction of 30 documents from20_ng.txt, used for testing reason
Each model expose the following functions:
m.train(data, num_topics, preprocessing) # => 'success'for i, x in enumerate(m.topics): print(f'Topic {i}') for word, weight in zip(x['words'], x['weights']): print(f'- {word} => {weight}')x = m.topic(0) words = x['words'] weights= x['weights']for i, p in enumerate(m.get_corpus_predictions(topn=3)): # predictions for each document print(f'Predictions on document {i}') for topic, confidence in p: print(f'- Topic {topic} with confidence {confidence}') # - Topic 21 with confidence 0.03927058187976461pred = m.predict(text, topn=3) for topic, confidence in pred: print(f'- Topic {topic} with confidence {confidence}') # - Topic 21 with confidence 0.03927058187976461# coherence: Type of coherence to compute, among <c_v, c_npmi, c_uci, u_mass>. See https://radimrehurek.com/gensim/models/coherencemodel.html#gensim.models.coherencemodel.CoherenceModel pred = m.coherence(mycorpus, metric='c_v') print(pred) #{ # "c_v": 0.5186710138972105, # "c_v_std": 0.1810477961008996, # "c_v_per_topic": [ # 0.5845048872767505, # 0.30693460230781777, # 0.2611738203246824, # ... # ] #}# metric: Metric for computing the evaluation, among <purity, homogeneity, completeness, v-measure, nmi>. res = m.get_corpus_predictions(topn=1) v = m.evaluate(res, ground_truth_labels, metric='purity') # 0.7825333630516738The possible parameters can differ depending on the model.
Install this package
pip install tomodapi Use it in a Python script
from tomodapi import LdaModel # init the model m = LdaModel(model_path=path_location) # train on a corpus m.train(my_corpus, preprocessing=False, num_topics=10) # infer topic of a sentence best_topics = m.predict("In the time since the industrial revolution the climate has increasingly been affected by human activities that are causing global warming and climate change") topic,confidence = best_topics[0] # get top words for a given topic print(m.topic(topic)) #If the model_path is not specified, the library will load/save the model from/under models/<model_name>.
A web API is provided for accessing to the library as a service
You should install 2 dependencies:
- mallet 2.0.8 to be placed in
tomodapi\ - glove.6B.50d.txt to be placed in
tomodapi\glove
Under UNIX, you can use the download_dep.sh script.
sh download_dep.sh python server.py Alternatively, you can run a docker container with
docker-compose -f docker-compose.yml up The container uses mounted volumes so that you can easily update/access to the computed models and the data files.
docker build -t hyperted/topic . docker run -p 27020:5000 --env APP_BASE_PATH=http://hyperted.eurecom.fr/topic/api -d -v /home/semantic/hyperted/tomodapi/models:/models -v /home/semantic/hyperted/tomodapi/data:/data --name hyperted_topic hyperted/topic # Uninstall docker stop hyperted_topic docker rm hyperted_topic docker rmi hyperted/topic If you find this library or API useful in your research, please consider citing our papers:
-
Pasquale Lisena P, Ismail Harrando I., Oussama Kandakji O. & Raphaël Troncy. ToModAPI: A Topic Modeling API to Train, Use and Compare Topic Models. In 2nd Workshop for Natural Language Processing Open Source Software (NLP-OSS), November 19, 2020. - paper - BIB
-
Ismail Harrando, Pasquale Lisena and Raphaël Troncy. Apples to Apples: A Systematic Evaluation of Topic Models. In Recent Advances in Natural Language Processing (RANLP), September 2021. - BIB - appendix