Cloud and Information Services Lab
Furong Huang UC Irvine Anima Anandkumar UC Irvine Nikos Karampatziakis Microsoft CISL Paul Mineiro + 𝜀 Microsoft CISL Sergiy Matusevych Microsoft CISL Shravan Narayanamurthy Microsoft CISL Markus Weimer Microsoft CISL Apache REEF Contributors Worldwide
/pos/cv107_24319.txt is evil dead ii a bad movie ? it's full of terrible acting , pointless violence , and plot holes yet it remains a cult classic nearly fifteen years after its release ... /pos/cv108_15571.txt it's rather strange too have two computer animated talking ant movies come out in a single year , but that is what disney and pixar animation ; s latest film represents ... http://www.cs.cornell.edu/People/pabo/movie-review-data
LDAvis library for R https://github.com/cpsievert/LDAvis
=*
𝑀2 ≝ 𝔼 𝑥1⨂𝑥2𝑀1 ≝ 𝔼 𝑥1
𝑀3 ≝ 𝔼 𝑥1⨂𝑥2⨂𝑥3
𝑀2 ≝ 𝔼 𝑥1⨂𝑥2 𝑀1 ≝ 𝔼 𝑥1 𝑀3 ≝ 𝔼 𝑥1⨂𝑥2⨂𝑥3 − 𝛼0 𝛼0 + 1 𝑀1⨂𝑀1 −[… more shift terms]
𝑀2 = 𝑖=1 𝑘 𝛼𝑖 ∙ 𝛽𝑖⨂𝛽𝑖 𝑀3 = 𝑖=1 𝑘 𝛼𝑖 ∙ 𝛽𝑖⨂𝛽𝑖⨂𝛽𝑖
𝑀3 ≝ 𝔼 𝑥1⨂𝑥2⨂𝑥3 𝑀3 𝜆1 𝑎1⨂𝑏1⨂𝑐1 = 𝜆1 𝜆2 𝑎2⨂𝑏2⨂𝑐2 + 𝜆2 + 𝜆3 ⋯ = 𝑖 𝜆𝑖 ∙ 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖
𝜆, 𝐴 ← argmin 𝜆∈ℝ 𝑘 𝐴∈ℝ 𝑘×𝑘 𝐴 ⋅ Diag 𝜆 ⋅ 𝐶⨀𝐵 ⊤ − 𝑀3 2
http://reef.incubator.apache.org
Storage (Focus: HDFS) HDFS ... Azure Block Storage ... Office 365 SQL / HIVE / LINQ Cloud Numerics Pregel GraphLab Programming Models (Domain Specific Languages) DatalabApplications Machine Learning BI Power* Resource Manager (Focus: YARN) YARN ... Mesos ... Azure Tasks Drawbridge REEF The Application Server for Big Data Communications, Storage, Fault Management, Interoperability Operator Layer (Future Work) REEF Operator API and Library REEF Logical Abstraction
Container
+ ∑⊕ ⊗ ⊗ ⊗
Easy to reason about Centralized control flow • Evaluator allocation and configuration • Task configuration and submission Centralized error handling • Task exceptions thrown to the Driver • Evaluator failures reported to the Driver Scalable Event-based programming • Driver sends requests as events to REEF • REEF sends events to the Driver Mostly stateless design • REEF maintains minimal state • Majority of state keeping (e.g. work queues) is maintained by the Driver
// Submit task to the newly created context public class ContextActiveHandler implements EventHandler<ActiveContext> { @Override public void onNext(final ActiveContext context) { taskGroups.submitNext(context); } } // Submit next task to current context public class TaskCompletedHandler implements EventHandler<CompletedTask> { @Override public void onNext(final CompletedTask task) { final ActiveContext context = task.getActiveContext(); taskGroups.submitNext(context); } }
@Inject public WhitenTask( final @Parameter(TaskConfigurationOptions.Identifier.class) String taskId, final @Parameter(Launch.DimD.class) int dimD, final @Parameter(Launch.DimK.class) int dimK, final GroupCommClient groupCommClient, final InputData data, final TaskEnvironment env) { // ... } “ ”Use Java “type system” to validate the configuration
// We can send and receive any Java serializable data, e.g. JBLAS matrices private final Broadcast.Sender<DoubleMatrix> modelSender; private final Broadcast.Receiver<DoubleMatrix[]> resultReceiver; // Broadcast the model, collect the results, repeat. do { this.modelSender.send(sliceA); // ... final DoubleMatrix[] result = this.resultReceiver.reduce(); } while (notConverged(sliceA, prevSliceA));
https://github.com/Microsoft-CISL/TensorFactorization http://reef.incubator.apache.org
motus@apache.org
𝑀2 = 𝑖 𝜆𝑖 ∙ 𝑢𝑖⨂𝑣𝑖 𝑀2 𝜆1 ∙ 𝑢1⨂𝑣1 = 𝜆1 𝜆2 ∙ 𝑢2⨂𝑣2 + 𝜆2 + 𝜆3 ⋯
𝑀3 ≝ 𝔼 𝑥1⨂𝑥2⨂𝑥3 𝑀3 𝜆1 𝑢1⨂𝑣1⨂𝑤1 = 𝜆1 𝜆2 𝑢2⨂𝑣2⨂𝑤2 + 𝜆2 + 𝜆3 ⋯ = 𝑖 𝜆𝑖 ∙ 𝑢𝑖⨂𝑣𝑖⨂𝑤𝑖
𝐼 𝑎1 𝑎1
• Find whitening matrix s.t. orthogonal • Use to find s.t. • Whiten :

Topic Modeling via Tensor Factorization - Use Case for Apache REEF

  • 1.
  • 2.
    Furong Huang UC Irvine AnimaAnandkumar UC Irvine Nikos Karampatziakis Microsoft CISL Paul Mineiro + 𝜀 Microsoft CISL Sergiy Matusevych Microsoft CISL Shravan Narayanamurthy Microsoft CISL Markus Weimer Microsoft CISL Apache REEF Contributors Worldwide
  • 5.
    /pos/cv107_24319.txt is evil deadii a bad movie ? it's full of terrible acting , pointless violence , and plot holes yet it remains a cult classic nearly fifteen years after its release ... /pos/cv108_15571.txt it's rather strange too have two computer animated talking ant movies come out in a single year , but that is what disney and pixar animation ; s latest film represents ... http://www.cs.cornell.edu/People/pabo/movie-review-data
  • 6.
    LDAvis library forR https://github.com/cpsievert/LDAvis
  • 7.
  • 10.
    𝑀2 ≝ 𝔼𝑥1⨂𝑥2𝑀1 ≝ 𝔼 𝑥1
  • 11.
    𝑀3 ≝ 𝔼𝑥1⨂𝑥2⨂𝑥3
  • 12.
    𝑀2 ≝ 𝔼𝑥1⨂𝑥2 𝑀1 ≝ 𝔼 𝑥1 𝑀3 ≝ 𝔼 𝑥1⨂𝑥2⨂𝑥3 − 𝛼0 𝛼0 + 1 𝑀1⨂𝑀1 −[… more shift terms]
  • 13.
    𝑀2 = 𝑖=1 𝑘 𝛼𝑖 ∙𝛽𝑖⨂𝛽𝑖 𝑀3 = 𝑖=1 𝑘 𝛼𝑖 ∙ 𝛽𝑖⨂𝛽𝑖⨂𝛽𝑖
  • 14.
    𝑀3 ≝ 𝔼𝑥1⨂𝑥2⨂𝑥3 𝑀3 𝜆1 𝑎1⨂𝑏1⨂𝑐1 = 𝜆1 𝜆2 𝑎2⨂𝑏2⨂𝑐2 + 𝜆2 + 𝜆3 ⋯ = 𝑖 𝜆𝑖 ∙ 𝑎𝑖⨂𝑏𝑖⨂𝑐𝑖
  • 15.
    𝜆, 𝐴 ←argmin 𝜆∈ℝ 𝑘 𝐴∈ℝ 𝑘×𝑘 𝐴 ⋅ Diag 𝜆 ⋅ 𝐶⨀𝐵 ⊤ − 𝑀3 2
  • 18.
  • 19.
    Storage (Focus: HDFS) HDFS ... Azure Block Storage ...Office 365 SQL / HIVE / LINQ Cloud Numerics Pregel GraphLab Programming Models (Domain Specific Languages) DatalabApplications Machine Learning BI Power* Resource Manager (Focus: YARN) YARN ... Mesos ... Azure Tasks Drawbridge REEF The Application Server for Big Data Communications, Storage, Fault Management, Interoperability Operator Layer (Future Work) REEF Operator API and Library REEF Logical Abstraction
  • 20.
  • 21.
  • 22.
    Easy to reasonabout Centralized control flow • Evaluator allocation and configuration • Task configuration and submission Centralized error handling • Task exceptions thrown to the Driver • Evaluator failures reported to the Driver Scalable Event-based programming • Driver sends requests as events to REEF • REEF sends events to the Driver Mostly stateless design • REEF maintains minimal state • Majority of state keeping (e.g. work queues) is maintained by the Driver
  • 23.
    // Submit taskto the newly created context public class ContextActiveHandler implements EventHandler<ActiveContext> { @Override public void onNext(final ActiveContext context) { taskGroups.submitNext(context); } } // Submit next task to current context public class TaskCompletedHandler implements EventHandler<CompletedTask> { @Override public void onNext(final CompletedTask task) { final ActiveContext context = task.getActiveContext(); taskGroups.submitNext(context); } }
  • 25.
    @Inject public WhitenTask( final @Parameter(TaskConfigurationOptions.Identifier.class)String taskId, final @Parameter(Launch.DimD.class) int dimD, final @Parameter(Launch.DimK.class) int dimK, final GroupCommClient groupCommClient, final InputData data, final TaskEnvironment env) { // ... } “ ”Use Java “type system” to validate the configuration
  • 29.
    // We cansend and receive any Java serializable data, e.g. JBLAS matrices private final Broadcast.Sender<DoubleMatrix> modelSender; private final Broadcast.Receiver<DoubleMatrix[]> resultReceiver; // Broadcast the model, collect the results, repeat. do { this.modelSender.send(sliceA); // ... final DoubleMatrix[] result = this.resultReceiver.reduce(); } while (notConverged(sliceA, prevSliceA));
  • 31.
  • 32.
  • 35.
    𝑀2 = 𝑖 𝜆𝑖 ∙𝑢𝑖⨂𝑣𝑖 𝑀2 𝜆1 ∙ 𝑢1⨂𝑣1 = 𝜆1 𝜆2 ∙ 𝑢2⨂𝑣2 + 𝜆2 + 𝜆3 ⋯
  • 37.
    𝑀3 ≝ 𝔼𝑥1⨂𝑥2⨂𝑥3 𝑀3 𝜆1 𝑢1⨂𝑣1⨂𝑤1 = 𝜆1 𝜆2 𝑢2⨂𝑣2⨂𝑤2 + 𝜆2 + 𝜆3 ⋯ = 𝑖 𝜆𝑖 ∙ 𝑢𝑖⨂𝑣𝑖⨂𝑤𝑖
  • 38.
  • 39.
    • Find whiteningmatrix s.t. orthogonal • Use to find s.t. • Whiten :

Editor's Notes

  • #3 We are hiring!
  • #4 What is the problem we are solving, why it’s important, and what are state-of-the-art solutions. New approach and our algorithm etc
  • #5 In general, given data (e.g. corpus of text, social graph, user pageview/click logs), reveal latent parameters that influence the distribution – communities, user preferences, text topics. We’ll talk about text because it’s easy to demo and reason about even on a small dataset
  • #6 Top 10 topics. Each document has a mixture of topics; some topics are common, e.g. film/movie/time. Word appear in many topics, e.g. action/crime/cop and action/Jackie Chan. Topics are sparse
  • #8 Start 3:20
  • #9 It’s all bag of words to me Nikolai Ge, Portrait of Leo Tolstoy, 1884 Tretyakov gallery, Moscow Writing what I believe
  • #10 Start 10
  • #11 Introduced by Karl Pearson in 1894; everything new is well forgotten old; so M1 is a vector, M2 a matrix; M2 is not enough for topics (there is spectral clustering – will talk later if asked). Need to capture triplets – a cube of data…
  • #13 It was shown that with these shifted terms M1..M3 are sufficient to reveal not only clusters, but mixtures of latent parameters. in fact, if you squint right, M2 is a covariance matrix, and a0 is a Dirichlet hyperprior. Similarly, M3 is skewness (shifted). I will give more details later. So this is information that we collect.. How to get the topics??
  • #15 8:25 We can factorize the tensor into a cross product of eigenvectors that reveal the topics. i.e. each vector beta_i contains probabilities of words in topic i.
  • #17 We can factorize the tensor into a cross product of eigenvectors that reveal the topics. i.e. each vector beta_i contains probabilities of words in topic i.
  • #19 it’s linear . Need resource manager, e.g. YARN, and distributed FS. . Master node checks for convergence
  • #20 Markus gave a talk at Hadoop Summit 2014 – see on YouTube
  • #21 Much nicer in C# REEF itself has very little state; all state is in the driver
  • #22 18:00
  • #25 Centralized error handling: mention Erlang/OTP supervisor architecture
  • #26 Much nicer in C# REEF itself has very little state; all state is in the driver
  • #27 Centralized error handling: mention Erlang/OTP supervisor architecture
  • #28 Java “type system”… Annotate constructor with @Inject, mark leaf parameters with @Parameter, other params must be classes with @Inject
  • #29 Centralized error handling: mention Erlang/OTP supervisor architecture
  • #32 Centralized error handling: mention Erlang/OTP supervisor architecture
  • #33 Centralized error handling: mention Erlang/OTP supervisor architecture
  • #34 Form a communication tree – nodes pass data along.. On reduce stage we also specify the aggregation operator
  • #35 Future work: community detection, larger datasets (pubmed), compare with LightLDA; in general: need better support for tensors (libraries, CUDA, parameter server)
  • #36 Future work: community detection, larger datasets (pubmed), compare with LightLDA; in general: need better support for tensors (libraries, CUDA, parameter server)
  • #37 Future work: community detection, larger datasets (pubmed), compare with LightLDA; in general: need better support for tensors (libraries, CUDA, parameter server) End: 20 min sharp Total ~24 min with questions
  • #38 Model (LDA) is independent from inference algorithms (variational Bayes, MCMC, tensors)