Institute for Web Science and Technologies · University of Koblenz-Landau, Germany Text Mining Using LDA with Context Christoph Kling, Steffen Staab Web and Internet Science Group · ECS · University of Southampton, UK &
Text Mining Using LDA with Context 2/68Steffen Staab Text Mining Documents Documents are  PDFs, emails, tweets, Flickr photo tags, CVs, ... Documents consist of  bag of words  metadata - author(s) - timestamp - geolocation - publisher - booktitle - device ... Chinese food Vegan food Break - fast dimsum duck eggs ... vegan tofu ... eggs ham ... Objective: Cluster, categorize, & explain
Text Mining Using LDA with Context 3/68Steffen Staab Latent Dirichlet Allocation (LDA)
Text Mining Using LDA with Context 4/68Steffen Staab Latent Dirichlet Allocation (LDA) Document-topic distributions Topic-word distributions K topics M documents Each doc m M has length Nm
Text Mining Using LDA with Context 5/68Steffen Staab Use Metadata to Help Topic Prediction  Improve topic detection → Morning times may help to improve the breakfast topic  Describe dependencies: metadata ↔ topics → breakfast topic happens during morning hours Chinese food Vegan food Break - fast dimsum duck eggs ... vegan tofu ... eggs ham ...
Text Mining Using LDA with Context 6/68Steffen Staab Use Metadata to Help Topic Prediction  Improve topic detection → Morning times may help to improve the breakfast topic  Describe dependencies: metadata ↔ topics → breakfast topic happens during morning hours  Usage  Autocompletion → From words to words  Prediction of search queries → From metadata to words → From words to metadata Chinese food Vegan food Break - fast dimsum duck eggs ... vegan tofu ... eggs ham ...
Text Mining Using LDA with Context 7/68Steffen Staab  Nominal  Ordinal  Cyclic  Spherical  Networked Structures of Metadata Spaces Nejdl Staa b Kling
Text Mining Using LDA with Context 8/68Steffen Staab Challenges for Using Metadata for Text Mining  Generalizing the Text Mining Model Creating a special text mining model for every dataset with its kind of metadata spaces is impractical → we need flexible models!
Text Mining Using LDA with Context 9/68Steffen Staab Challenges for Using Metadata for Text Mining  Generalizing the Text Mining Model  Efficiency of the Text Mining Model Rich metadata → complex models → complex inference, slow convergence of samplers → analysis of big datasets impossible
Text Mining Using LDA with Context 10/68Steffen Staab Challenges for Using Metadata for Text Mining  Generalizing the Text Mining Model  Efficiency of the Text Mining Model  Explaining the Result Importance of Metadata → learn how to weight metadata → exclude irrelevant metadata (improves efficiency!) Complex dependencies & complex probability functions → Learned parameters incomprehensible → Reduced usefulness for data analysis / visualisation → No sanity checks on parameters
Text Mining Using LDA with Context 11/68Steffen Staab Topic Models for Arbitrary Metadata
Text Mining Using LDA with Context 12/68Steffen Staab Topic Models for Arbitrary Metadata  Predict document-topic distributions using metadata → Gaussian Process Regression Topic Model (Agovic & Banerjee, 2012) → Dirichlet-Multinomial Regression Topic Model (Mimno & McCallum, 2012) → Structural Topic Model (logistic normal regression) (Roberts et al., 2013)
Text Mining Using LDA with Context 13/68Steffen Staab Topic Models for Arbitrary Metadata  Predict document-topic distributions using metadata → Gaussian Process Regression Topic Model → Dirichlet-Multinomial Regression Topic Model → Structural Topic Model (logistic normal regression) Regression input: Metadata Regression output: Topic distribution
Text Mining Using LDA with Context 14/68Steffen Staab Topic Models for Arbitrary Metadata Dirichlet-multinomial regression Metadata Document-topic distributions
Text Mining Using LDA with Context 15/68Steffen Staab Topic Models for Arbitrary Metadata Gaussian process regression Metadata Document-topic distributions
Text Mining Using LDA with Context 16/68Steffen Staab Topic Models for Arbitrary Metadata Logistic normal regression Metadata Document-topic distributions
Text Mining Using LDA with Context 17/68Steffen Staab Topic Models for Arbitrary Metadata  Alternating inference:  Estimate topics  Estimate regression model  Use prediction for re-estimating topics  Re-estimate regression model with new topics  ...
Text Mining Using LDA with Context 18/68Steffen Staab Topic Models for Arbitrary Metadata  Alternating inference:  Estimate topics  Estimate regression model  Use prediction for re-estimating topics  Re-estimate regression model with new topics  ...
Text Mining Using LDA with Context 19/68Steffen Staab Topic Models for Arbitrary Metadata  Applicable to a wide range of metadata!  Estimation of regression parameters relatively expensive  Learned parameters have no natural interpretation  Alternating process of paramter estimation is expensive
Text Mining Using LDA with Context 20/68Steffen Staab Topic Models for Arbitrary Metadata  Dirichlet-multinomial and logistic-normal regression do not support complex input data (i.e. geographical data, temporal cycles, …)  Gaussian process regression topic models are very powerful with the right kernel function ...but require expert knowledge for kernel selection and efficient inference!
Text Mining Using LDA with Context 21/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Models The Idea
Text Mining Using LDA with Context 22/68Steffen Staab Topic Prediction TopicProbability Metadata (e.g. time) Documents, e.g. emails
Text Mining Using LDA with Context 23/68Steffen Staab Dirichlet-Multinomial Regression TopicProbability Metadata (e.g. time)
Text Mining Using LDA with Context 24/68Steffen Staab Gaussian Process Regression TopicProbability Metadata (e.g. time) TopicProbability
Text Mining Using LDA with Context 25/68Steffen Staab Cluster-Based Prediction TopicProbability Metadata (e.g. time)
Text Mining Using LDA with Context 26/68Steffen Staab Cluster-Based Prediction TopicProbability Metadata (e.g. time)
Text Mining Using LDA with Context 27/68Steffen Staab Cluster-Based Prediction TopicProbability Metadata (e.g. time) TopicProbabilityTopicProbabilityTopicProbability
Text Mining Using LDA with Context 28/68Steffen Staab Cluster-Based Prediction TopicProbability Metadata (e.g. time) TopicProbabilityTopicProbabilityTopicProbability
Text Mining Using LDA with Context 29/68Steffen Staab Idea  Two-step model: 1)Cluster similar documents 2)Learn topics for clusters and documents simultaneously ▪ Learn topic distributions of document clusters ▪ Use cluster-topic distributions for topic prediction
Text Mining Using LDA with Context 30/68Steffen Staab Performance, Complex Metadata  Cluster documents for each metadata
Text Mining Using LDA with Context 31/68Steffen Staab Performance, Complex Metadata  Cluster documents for each metadata
Text Mining Using LDA with Context 32/68Steffen Staab Performance, Complex Metadata  Cluster documents for each metadata + nominal, ordinal, cyclic, spherical data + any data which can be clustered!
Text Mining Using LDA with Context 33/68Steffen Staab Performance, Complex Metadata  Metadata clusters are associated with topics German Beer Party
Text Mining Using LDA with Context 34/68Steffen Staab Mixture of Metadata Predictions  Metadata clusters are associated with topics German Beer Party  The topic prediction for a single document is a mixture of the prediction of its metadata clusters
Text Mining Using LDA with Context 35/68Steffen Staab Smoothing of HMDP
Text Mining Using LDA with Context 36/68Steffen Staab Cluster-Based Prediction vs Outliers and noisy data TopicProbability Metadata (e.g. time)
Text Mining Using LDA with Context 37/68Steffen Staab Adjacency Smoothing  Naive approach: Smoothed value of a cluster is the mean of the cluster and its adjacent clusters  Repeat n times
Text Mining Using LDA with Context 38/68Steffen Staab Smoothing topics associated with metadata clusters  Documents receive topics from their own and neighboring metadata clusters
Text Mining Using LDA with Context 39/68Steffen Staab Performance, Complex Metadata  Smooth topics associated with metadata clusters
Text Mining Using LDA with Context 40/68Steffen Staab  Nominal  Ordinal  Cyclic  Spherical  Networked
Text Mining Using LDA with Context 41/68Steffen Staab Smoothing  Smoothing-strength is learned during inference Similar clusters → stronger smoothing Dissimilar clusters → softer smoothing  Smoothing-strength alternatively can be predefined by user
Text Mining Using LDA with Context 42/68Steffen Staab Metadata Weighting in HMDP's
Text Mining Using LDA with Context 43/68Steffen Staab Feature Weighting  One variable governs the influence of metadata cluster on documents  If η < threshold, ignore variable. η
Text Mining Using LDA with Context 44/68Steffen Staab Metadata Weighting  Importance of metadata is learned during inference, answering the question: How many percent of the topics are explained by a given metadata? (e.g. time, geographical coordinates, ...) → Interpretable parameter!  Metadata with a low weight can be removed during inference
Text Mining Using LDA with Context 45/68Steffen Staab Example Application
Text Mining Using LDA with Context 46/68Steffen Staab Dataset  Linux Kernel Mailinglist 3,400,000 emails with timestamps and mailinglist ID
Text Mining Using LDA with Context 47/68Steffen Staab Dataset  Linux Kernel Mailinglist 3,400,000 emails with timestamps and mailinglist ID  Timeline  Yearly cycle  Weekly cycle  Daily cycle  Mailing list
Text Mining Using LDA with Context 48/68Steffen Staab Topics
Text Mining Using LDA with Context 49/68Steffen Staab Topics
Text Mining Using LDA with Context 50/68Steffen Staab Topics  Professional topics:  Hobbyist topics:
Text Mining Using LDA with Context 51/68Steffen Staab Topics  Metadata weighting:
Text Mining Using LDA with Context 52/68Steffen Staab Topics  Metadata weighting: can be removed during inference
Text Mining Using LDA with Context 53/68Steffen Staab Efficient Inference in HMDP
Text Mining Using LDA with Context 54/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) Cluster-topic distributions Document-topic distributions Metadata
Text Mining Using LDA with Context 55/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) Inference: Nearly completely collapsed inference!
Text Mining Using LDA with Context 56/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) We only need to learn  Global topic distribution  Topic assignments to words
Text Mining Using LDA with Context 57/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) We only need to learn  Global topic distribution  Topic assignments to words  Dirichlet parameters
Text Mining Using LDA with Context 58/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) Approximations:  Variational  Practical  Stochastic → low memory consumption → online inference
Text Mining Using LDA with Context 59/68Steffen Staab Parameters of HMDP  Cluster-topic distributions: How many documents of a cluster contain topic x?
Text Mining Using LDA with Context 60/68Steffen Staab Parameters of HMDP  Cluster-topic distributions: How many documents of a cluster contain topic x?  Metadata-weights How many of the topics of documents are explained by metadata x?
Text Mining Using LDA with Context 61/68Steffen Staab Parameters of HMDP  Cluster-topic distributions: How many documents of a cluster contain topic x?  Metadata-weights How many of the topics of documents are explained by metadata x?  Dirichlet process scaling parameters How many pseudo-counts do we add to the topic distributions?
Text Mining Using LDA with Context 62/68Steffen Staab Properties of HMDP  Interpretable parameters  Simultaneous inference of topics and metadata-topic dependencies  Efficient online inference
Text Mining Using LDA with Context 63/68Steffen Staab Comparison of Topic Models for Arbitrary Metadata
Text Mining Using LDA with Context 64/68Steffen Staab Comparison  Gaussian Process Topic Model The “perfect” model:  Can cope with arbitrary metadata  Models dependencies between metadata  Parameter learning is very expensive  Kernel selection and inference require expert knowledge  Parameters of Gaussian processes hard to interpret
Text Mining Using LDA with Context 65/68Steffen Staab Comparison  Multinomial Regression Topic Model The “straight-forward” model:  Can cope with many metadata  Parameter learning is cheaper than for Gaussian processes but still expensive (due to alternating inference and repeated distance calculations)  Can not cope with complex metadata (e.g. geographical, cyclic, ...)  Does not model dependencies between metadata  Regression weights of Dirichlet-multinomial regression hard to interpret
Text Mining Using LDA with Context 66/68Steffen Staab Comparison  Hierarchical Multi-Dirichlet Process Topic Model The “fast” model:  Can cope with arbitrary metadata  Fast inference (simultaneously for topics and topic predictions)  All parameters have natural interpretations as probabilities or pseudo-counts  Requires a (simple) pre-clustering of documents  Does not model dependencies between metadata
Text Mining Using LDA with Context 67/68Steffen Staab THANK YOU FOR YOUR ATTENTION!

Text Mining using LDA with Context

  • 1.
    Institute for WebScience and Technologies · University of Koblenz-Landau, Germany Text Mining Using LDA with Context Christoph Kling, Steffen Staab Web and Internet Science Group · ECS · University of Southampton, UK &
  • 2.
    Text Mining UsingLDA with Context 2/68Steffen Staab Text Mining Documents Documents are  PDFs, emails, tweets, Flickr photo tags, CVs, ... Documents consist of  bag of words  metadata - author(s) - timestamp - geolocation - publisher - booktitle - device ... Chinese food Vegan food Break - fast dimsum duck eggs ... vegan tofu ... eggs ham ... Objective: Cluster, categorize, & explain
  • 3.
    Text Mining UsingLDA with Context 3/68Steffen Staab Latent Dirichlet Allocation (LDA)
  • 4.
    Text Mining UsingLDA with Context 4/68Steffen Staab Latent Dirichlet Allocation (LDA) Document-topic distributions Topic-word distributions K topics M documents Each doc m M has length Nm
  • 5.
    Text Mining UsingLDA with Context 5/68Steffen Staab Use Metadata to Help Topic Prediction  Improve topic detection → Morning times may help to improve the breakfast topic  Describe dependencies: metadata ↔ topics → breakfast topic happens during morning hours Chinese food Vegan food Break - fast dimsum duck eggs ... vegan tofu ... eggs ham ...
  • 6.
    Text Mining UsingLDA with Context 6/68Steffen Staab Use Metadata to Help Topic Prediction  Improve topic detection → Morning times may help to improve the breakfast topic  Describe dependencies: metadata ↔ topics → breakfast topic happens during morning hours  Usage  Autocompletion → From words to words  Prediction of search queries → From metadata to words → From words to metadata Chinese food Vegan food Break - fast dimsum duck eggs ... vegan tofu ... eggs ham ...
  • 7.
    Text Mining UsingLDA with Context 7/68Steffen Staab  Nominal  Ordinal  Cyclic  Spherical  Networked Structures of Metadata Spaces Nejdl Staa b Kling
  • 8.
    Text Mining UsingLDA with Context 8/68Steffen Staab Challenges for Using Metadata for Text Mining  Generalizing the Text Mining Model Creating a special text mining model for every dataset with its kind of metadata spaces is impractical → we need flexible models!
  • 9.
    Text Mining UsingLDA with Context 9/68Steffen Staab Challenges for Using Metadata for Text Mining  Generalizing the Text Mining Model  Efficiency of the Text Mining Model Rich metadata → complex models → complex inference, slow convergence of samplers → analysis of big datasets impossible
  • 10.
    Text Mining UsingLDA with Context 10/68Steffen Staab Challenges for Using Metadata for Text Mining  Generalizing the Text Mining Model  Efficiency of the Text Mining Model  Explaining the Result Importance of Metadata → learn how to weight metadata → exclude irrelevant metadata (improves efficiency!) Complex dependencies & complex probability functions → Learned parameters incomprehensible → Reduced usefulness for data analysis / visualisation → No sanity checks on parameters
  • 11.
    Text Mining UsingLDA with Context 11/68Steffen Staab Topic Models for Arbitrary Metadata
  • 12.
    Text Mining UsingLDA with Context 12/68Steffen Staab Topic Models for Arbitrary Metadata  Predict document-topic distributions using metadata → Gaussian Process Regression Topic Model (Agovic & Banerjee, 2012) → Dirichlet-Multinomial Regression Topic Model (Mimno & McCallum, 2012) → Structural Topic Model (logistic normal regression) (Roberts et al., 2013)
  • 13.
    Text Mining UsingLDA with Context 13/68Steffen Staab Topic Models for Arbitrary Metadata  Predict document-topic distributions using metadata → Gaussian Process Regression Topic Model → Dirichlet-Multinomial Regression Topic Model → Structural Topic Model (logistic normal regression) Regression input: Metadata Regression output: Topic distribution
  • 14.
    Text Mining UsingLDA with Context 14/68Steffen Staab Topic Models for Arbitrary Metadata Dirichlet-multinomial regression Metadata Document-topic distributions
  • 15.
    Text Mining UsingLDA with Context 15/68Steffen Staab Topic Models for Arbitrary Metadata Gaussian process regression Metadata Document-topic distributions
  • 16.
    Text Mining UsingLDA with Context 16/68Steffen Staab Topic Models for Arbitrary Metadata Logistic normal regression Metadata Document-topic distributions
  • 17.
    Text Mining UsingLDA with Context 17/68Steffen Staab Topic Models for Arbitrary Metadata  Alternating inference:  Estimate topics  Estimate regression model  Use prediction for re-estimating topics  Re-estimate regression model with new topics  ...
  • 18.
    Text Mining UsingLDA with Context 18/68Steffen Staab Topic Models for Arbitrary Metadata  Alternating inference:  Estimate topics  Estimate regression model  Use prediction for re-estimating topics  Re-estimate regression model with new topics  ...
  • 19.
    Text Mining UsingLDA with Context 19/68Steffen Staab Topic Models for Arbitrary Metadata  Applicable to a wide range of metadata!  Estimation of regression parameters relatively expensive  Learned parameters have no natural interpretation  Alternating process of paramter estimation is expensive
  • 20.
    Text Mining UsingLDA with Context 20/68Steffen Staab Topic Models for Arbitrary Metadata  Dirichlet-multinomial and logistic-normal regression do not support complex input data (i.e. geographical data, temporal cycles, …)  Gaussian process regression topic models are very powerful with the right kernel function ...but require expert knowledge for kernel selection and efficient inference!
  • 21.
    Text Mining UsingLDA with Context 21/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Models The Idea
  • 22.
    Text Mining UsingLDA with Context 22/68Steffen Staab Topic Prediction TopicProbability Metadata (e.g. time) Documents, e.g. emails
  • 23.
    Text Mining UsingLDA with Context 23/68Steffen Staab Dirichlet-Multinomial Regression TopicProbability Metadata (e.g. time)
  • 24.
    Text Mining UsingLDA with Context 24/68Steffen Staab Gaussian Process Regression TopicProbability Metadata (e.g. time) TopicProbability
  • 25.
    Text Mining UsingLDA with Context 25/68Steffen Staab Cluster-Based Prediction TopicProbability Metadata (e.g. time)
  • 26.
    Text Mining UsingLDA with Context 26/68Steffen Staab Cluster-Based Prediction TopicProbability Metadata (e.g. time)
  • 27.
    Text Mining UsingLDA with Context 27/68Steffen Staab Cluster-Based Prediction TopicProbability Metadata (e.g. time) TopicProbabilityTopicProbabilityTopicProbability
  • 28.
    Text Mining UsingLDA with Context 28/68Steffen Staab Cluster-Based Prediction TopicProbability Metadata (e.g. time) TopicProbabilityTopicProbabilityTopicProbability
  • 29.
    Text Mining UsingLDA with Context 29/68Steffen Staab Idea  Two-step model: 1)Cluster similar documents 2)Learn topics for clusters and documents simultaneously ▪ Learn topic distributions of document clusters ▪ Use cluster-topic distributions for topic prediction
  • 30.
    Text Mining UsingLDA with Context 30/68Steffen Staab Performance, Complex Metadata  Cluster documents for each metadata
  • 31.
    Text Mining UsingLDA with Context 31/68Steffen Staab Performance, Complex Metadata  Cluster documents for each metadata
  • 32.
    Text Mining UsingLDA with Context 32/68Steffen Staab Performance, Complex Metadata  Cluster documents for each metadata + nominal, ordinal, cyclic, spherical data + any data which can be clustered!
  • 33.
    Text Mining UsingLDA with Context 33/68Steffen Staab Performance, Complex Metadata  Metadata clusters are associated with topics German Beer Party
  • 34.
    Text Mining UsingLDA with Context 34/68Steffen Staab Mixture of Metadata Predictions  Metadata clusters are associated with topics German Beer Party  The topic prediction for a single document is a mixture of the prediction of its metadata clusters
  • 35.
    Text Mining UsingLDA with Context 35/68Steffen Staab Smoothing of HMDP
  • 36.
    Text Mining UsingLDA with Context 36/68Steffen Staab Cluster-Based Prediction vs Outliers and noisy data TopicProbability Metadata (e.g. time)
  • 37.
    Text Mining UsingLDA with Context 37/68Steffen Staab Adjacency Smoothing  Naive approach: Smoothed value of a cluster is the mean of the cluster and its adjacent clusters  Repeat n times
  • 38.
    Text Mining UsingLDA with Context 38/68Steffen Staab Smoothing topics associated with metadata clusters  Documents receive topics from their own and neighboring metadata clusters
  • 39.
    Text Mining UsingLDA with Context 39/68Steffen Staab Performance, Complex Metadata  Smooth topics associated with metadata clusters
  • 40.
    Text Mining UsingLDA with Context 40/68Steffen Staab  Nominal  Ordinal  Cyclic  Spherical  Networked
  • 41.
    Text Mining UsingLDA with Context 41/68Steffen Staab Smoothing  Smoothing-strength is learned during inference Similar clusters → stronger smoothing Dissimilar clusters → softer smoothing  Smoothing-strength alternatively can be predefined by user
  • 42.
    Text Mining UsingLDA with Context 42/68Steffen Staab Metadata Weighting in HMDP's
  • 43.
    Text Mining UsingLDA with Context 43/68Steffen Staab Feature Weighting  One variable governs the influence of metadata cluster on documents  If η < threshold, ignore variable. η
  • 44.
    Text Mining UsingLDA with Context 44/68Steffen Staab Metadata Weighting  Importance of metadata is learned during inference, answering the question: How many percent of the topics are explained by a given metadata? (e.g. time, geographical coordinates, ...) → Interpretable parameter!  Metadata with a low weight can be removed during inference
  • 45.
    Text Mining UsingLDA with Context 45/68Steffen Staab Example Application
  • 46.
    Text Mining UsingLDA with Context 46/68Steffen Staab Dataset  Linux Kernel Mailinglist 3,400,000 emails with timestamps and mailinglist ID
  • 47.
    Text Mining UsingLDA with Context 47/68Steffen Staab Dataset  Linux Kernel Mailinglist 3,400,000 emails with timestamps and mailinglist ID  Timeline  Yearly cycle  Weekly cycle  Daily cycle  Mailing list
  • 48.
    Text Mining UsingLDA with Context 48/68Steffen Staab Topics
  • 49.
    Text Mining UsingLDA with Context 49/68Steffen Staab Topics
  • 50.
    Text Mining UsingLDA with Context 50/68Steffen Staab Topics  Professional topics:  Hobbyist topics:
  • 51.
    Text Mining UsingLDA with Context 51/68Steffen Staab Topics  Metadata weighting:
  • 52.
    Text Mining UsingLDA with Context 52/68Steffen Staab Topics  Metadata weighting: can be removed during inference
  • 53.
    Text Mining UsingLDA with Context 53/68Steffen Staab Efficient Inference in HMDP
  • 54.
    Text Mining UsingLDA with Context 54/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) Cluster-topic distributions Document-topic distributions Metadata
  • 55.
    Text Mining UsingLDA with Context 55/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) Inference: Nearly completely collapsed inference!
  • 56.
    Text Mining UsingLDA with Context 56/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) We only need to learn  Global topic distribution  Topic assignments to words
  • 57.
    Text Mining UsingLDA with Context 57/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) We only need to learn  Global topic distribution  Topic assignments to words  Dirichlet parameters
  • 58.
    Text Mining UsingLDA with Context 58/68Steffen Staab Hierarchical Multi-Dirichlet Process Topic Model (HMDP) Approximations:  Variational  Practical  Stochastic → low memory consumption → online inference
  • 59.
    Text Mining UsingLDA with Context 59/68Steffen Staab Parameters of HMDP  Cluster-topic distributions: How many documents of a cluster contain topic x?
  • 60.
    Text Mining UsingLDA with Context 60/68Steffen Staab Parameters of HMDP  Cluster-topic distributions: How many documents of a cluster contain topic x?  Metadata-weights How many of the topics of documents are explained by metadata x?
  • 61.
    Text Mining UsingLDA with Context 61/68Steffen Staab Parameters of HMDP  Cluster-topic distributions: How many documents of a cluster contain topic x?  Metadata-weights How many of the topics of documents are explained by metadata x?  Dirichlet process scaling parameters How many pseudo-counts do we add to the topic distributions?
  • 62.
    Text Mining UsingLDA with Context 62/68Steffen Staab Properties of HMDP  Interpretable parameters  Simultaneous inference of topics and metadata-topic dependencies  Efficient online inference
  • 63.
    Text Mining UsingLDA with Context 63/68Steffen Staab Comparison of Topic Models for Arbitrary Metadata
  • 64.
    Text Mining UsingLDA with Context 64/68Steffen Staab Comparison  Gaussian Process Topic Model The “perfect” model:  Can cope with arbitrary metadata  Models dependencies between metadata  Parameter learning is very expensive  Kernel selection and inference require expert knowledge  Parameters of Gaussian processes hard to interpret
  • 65.
    Text Mining UsingLDA with Context 65/68Steffen Staab Comparison  Multinomial Regression Topic Model The “straight-forward” model:  Can cope with many metadata  Parameter learning is cheaper than for Gaussian processes but still expensive (due to alternating inference and repeated distance calculations)  Can not cope with complex metadata (e.g. geographical, cyclic, ...)  Does not model dependencies between metadata  Regression weights of Dirichlet-multinomial regression hard to interpret
  • 66.
    Text Mining UsingLDA with Context 66/68Steffen Staab Comparison  Hierarchical Multi-Dirichlet Process Topic Model The “fast” model:  Can cope with arbitrary metadata  Fast inference (simultaneously for topics and topic predictions)  All parameters have natural interpretations as probabilities or pseudo-counts  Requires a (simple) pre-clustering of documents  Does not model dependencies between metadata
  • 67.
    Text Mining UsingLDA with Context 67/68Steffen Staab THANK YOU FOR YOUR ATTENTION!