Algorithmic Data Science = Theory + Practice Matteo Riondato – Labs, Two Sigma Investments @teorionda – http://matteo.rionda.to IEEE MIT URTC – November 5, 2016 1 / 24
Matteo Riondato Ph.D. in CS Working at Labs, Two Sigma Investments (Research Scientist); CS Dept., Brown U. (Visiting Asst. Prof.); Doing research on algorithmic data science; Tweeting @teorionda; Reading matteo@twosigma.com; “Living” at http://matteo.rionda.to. 2 / 24
Conjecture Let X be a scientific discipline. Then 21st -century X = datascience (X) + ε . Partial evidence: “Computational X” exists for many X. 3 / 24
data science : 21st century = statistics : 20th century 4 / 24
data science for 21st century society    questions data 5 / 24
data science 6 / 24
data science 6 / 24
data science =    1/4 data representation and management 1/4 mathematical and statistical modeling 1/4 computational thinking and algorithms 1/4 domain expertise Shake well, and strain into a cocktail glass. 7 / 24
domain expertise modeling management algorithms 8 / 24
domain expertise modeling management algorithms 8 / 24
domain expertise modeling management algorithms 8 / 24
algorithmic data science: = algorithms for/with:    approximation guarantees data streams Spark/MapReduce sampling statistical testing graph analysis . . . 9 / 24
algorithmic data science = theory 10 / 24
algorithmic data science ≈ theory + practice 10 / 24
algorithmic data science = (theory × practice)(theory×practice) 10 / 24
Example 11 / 24
Scientific question: Find relevant webpages on the web, influential participants in a email chain, key proteins in a network, . . . Data representation: represent the data as a graph G = (V , E). a h b g f e c d Modeling question: What are the important nodes in a graph G = (V , E)? We need f : V → R+ to express the importance of a node. The higher is f (x), the more important is x ∈ V . 12 / 24
Domain Knowledge / Modeling: Assume that 1) every node wants to communicate with every node; and 2) communication progresses along Shortest Paths (SPs). Then, the higher the no. of SPs that a node v belongs to, the more important v is. Definition For each node x ∈ V , the betweeness b(x) of x is: b(x) = 1 n(n − 1) u=x=v∈V σuv (x) σuv ∈ [0, 1] • σuv : number of SPs from u to v, u, v ∈ V ; • σuv (x): number of SPs from u to v that go through x. I.e., b(x) is weighted fraction of SPs that go through x, among all SPs in G. 13 / 24
a h b g f e c d Node x a b c d e f g h b(x) 0 0.250 0.125 0.036 0.054 0.080 0.268 0 14 / 24
Algorithmic question: How to compute all b(x)? 15 / 24
Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 15 / 24
Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 15 / 24
Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
Algorithmic question: How to compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) Time complexity: O(nm + n2 log n) n Dijkstra’s, plus n backward walks, taking at most n each Too much even with just 104 nodes. 15 / 24
Modeling / Domain knowledge: High-quality approximations of all BCs are sufficient. 16 / 24
Modeling / Domain knowledge: High-quality approximations of all BCs are sufficient. Let ε ∈ (0, 1), and δ ∈ (0, 1) be user-specified parameters; An (ε, δ)-approximation is a set {b(x), x ∈ V } of n values s.t. Pr(∃x ∈ V s.t. |b(x) − b(x)| > ε) ≤ δ i.e., with prob. ≥ 1 − δ, for all x ∈ V , b(x) is within ε of b(x): a uniform probabilistic guarantee over all the estimations. 16 / 24
Algorithmic question: How to obtain an (ε, δ)-approximation quickly? Answer: Sampling Instead of computing all the SPs from each node x ∈ V , compute them only from some randomly chosen nodes (samples). Theory question: How many samples do we need to obtain an (ε, δ)-approximation? The more the better, but really, how many? 17 / 24
How many samples do we need to obtain an (ε, δ)-approximation? Theory: Hoeffding Bound + Union Bound 18 / 24
How many samples do we need to obtain an (ε, δ)-approximation? Theory: Hoeffding Bound + Union Bound Need O 1 ε2 log |V | + log 1 δ samples 18 / 24
How many samples do we need to obtain an (ε, δ)-approximation? Theory: Hoeffding Bound + Union Bound Need O 1 ε2 log |V | + log 1 δ samples Comments Practice: Fewer samples than the above are sufficient for (ε, δ)-approx. Theory: Dependency on |V | and not on edge structure seems wrong. 18 / 24
How many samples do we need to obtain an (ε, δ)-approximation? Theory: Vapnik-Chervonenkis (VC) Dimension Developed to evaluate supervised learning classifiers. We twisted it to work in a non-supervised graph mining problem. “The most practical theory ever” – Me, right now 19 / 24
How many samples do we need to obtain an (ε, δ)-approximation? Theory: Vapnik-Chervonenkis (VC) Dimension Developed to evaluate supervised learning classifiers. We twisted it to work in a non-supervised graph mining problem. “The most practical theory ever” – Me, right now Need O 1 ε2 log diam(G) + log 1 δ samples Decreased sample size exponentially on small-world networks. 19 / 24
How many samples do we need to obtain an (ε, δ)-approximation? Theory: Vapnik-Chervonenkis (VC) Dimension Developed to evaluate supervised learning classifiers. We twisted it to work in a non-supervised graph mining problem. “The most practical theory ever” – Me, right now Need O 1 ε2 log diam(G) + log 1 δ samples Decreased sample size exponentially on small-world networks. Comments Practice: Great improvement but still too many samples. Theory: Graphs with the same diameter are not equally “hard”. 19 / 24
How many samples do we need to obtain an (ε, δ)-approximation? Theory: Progressive sampling + Rademacher Averages Let’s start sampling, use the sample to decide when to stop. 20 / 24
How many samples do we need to obtain an (ε, δ)-approximation? Theory: Progressive sampling + Rademacher Averages Let’s start sampling, use the sample to decide when to stop. Stop when ηi ≤ ε, where ηi is. . . 20 / 24
How many samples do we need to obtain an (ε, δ)-approximation? Theory: Progressive sampling + Rademacher Averages Let’s start sampling, use the sample to decide when to stop. Stop when ηi ≤ ε, where ηi is. . . ηi = 2 min t∈R+ 1 t ln (r,C)∈T et2 r2 /(2S2 i ) + 3 (i + 1) ln(2/δ) 2Si Comments Practice: Getting closer to the empirical bound Theory: Proving stuff is getting complicated (isn’t that good?) 20 / 24
Theory + Practice: Get rid of “theoretical elegance” while maintaining correctness. 21 / 24
Theory + Practice: Get rid of “theoretical elegance” while maintaining correctness. Let gS(x, y) = 2 exp −2 x2 (y − 2RF (S))2 + exp − ((1 − x)y + 2xRF (S)) φ 2RF (S) (1 − x)y + 2xRF (S) − 1 . Then compute min x,ξ ξ s.t. gS(x, ξ) ≤ η ξ ∈ (2RF (S), 1] x ∈ (0, 1) and check if ξ < ε. 21 / 24
To be a data scientist, you need to get your hands dirty in data. To be an algorithmic data scientist, you need to get your hands dirty in    data theory 22 / 24
Other examples    pattern mining (Rademacher Averages) selectivity of database queries (VC-dimension) triangle counting from data streams (non-i.i.d. sampling) graph summarization (Szemerédi Regularity) 23 / 24
1) Embrace data science 2) Combine theory and practice 24 / 24
1) Embrace data science 2) Combine theory and practice Thank you! EML: matteo@twosigma.com TWTR: @teorionda WWW: http://matteo.rionda.to 24 / 24
This document is being distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.

Algorithmic Data Science = Theory + Practice

  • 1.
    Algorithmic Data Science = Theory+ Practice Matteo Riondato – Labs, Two Sigma Investments @teorionda – http://matteo.rionda.to IEEE MIT URTC – November 5, 2016 1 / 24
  • 2.
    Matteo Riondato Ph.D. inCS Working at Labs, Two Sigma Investments (Research Scientist); CS Dept., Brown U. (Visiting Asst. Prof.); Doing research on algorithmic data science; Tweeting @teorionda; Reading matteo@twosigma.com; “Living” at http://matteo.rionda.to. 2 / 24
  • 3.
    Conjecture Let X bea scientific discipline. Then 21st -century X = datascience (X) + ε . Partial evidence: “Computational X” exists for many X. 3 / 24
  • 4.
    data science :21st century = statistics : 20th century 4 / 24
  • 5.
    data science for21st century society    questions data 5 / 24
  • 6.
  • 7.
  • 8.
    data science =    1/4data representation and management 1/4 mathematical and statistical modeling 1/4 computational thinking and algorithms 1/4 domain expertise Shake well, and strain into a cocktail glass. 7 / 24
  • 9.
  • 10.
  • 11.
  • 12.
    algorithmic data science: = algorithmsfor/with:    approximation guarantees data streams Spark/MapReduce sampling statistical testing graph analysis . . . 9 / 24
  • 13.
  • 14.
  • 15.
    algorithmic data science = (theory× practice)(theory×practice) 10 / 24
  • 16.
  • 17.
    Scientific question: Findrelevant webpages on the web, influential participants in a email chain, key proteins in a network, . . . Data representation: represent the data as a graph G = (V , E). a h b g f e c d Modeling question: What are the important nodes in a graph G = (V , E)? We need f : V → R+ to express the importance of a node. The higher is f (x), the more important is x ∈ V . 12 / 24
  • 18.
    Domain Knowledge /Modeling: Assume that 1) every node wants to communicate with every node; and 2) communication progresses along Shortest Paths (SPs). Then, the higher the no. of SPs that a node v belongs to, the more important v is. Definition For each node x ∈ V , the betweeness b(x) of x is: b(x) = 1 n(n − 1) u=x=v∈V σuv (x) σuv ∈ [0, 1] • σuv : number of SPs from u to v, u, v ∈ V ; • σuv (x): number of SPs from u to v that go through x. I.e., b(x) is weighted fraction of SPs that go through x, among all SPs in G. 13 / 24
  • 19.
    a h b g f e cd Node x a b c d e f g h b(x) 0 0.250 0.125 0.036 0.054 0.080 0.268 0 14 / 24
  • 20.
    Algorithmic question: Howto compute all b(x)? 15 / 24
  • 21.
    Algorithmic question: Howto compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 15 / 24
  • 22.
    Algorithmic question: Howto compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 15 / 24
  • 23.
    Algorithmic question: Howto compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
  • 24.
    Algorithmic question: Howto compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
  • 25.
    Algorithmic question: Howto compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
  • 26.
    Algorithmic question: Howto compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
  • 27.
    Algorithmic question: Howto compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
  • 28.
    Algorithmic question: Howto compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) 15 / 24
  • 29.
    Algorithmic question: Howto compute all b(x)? Brandes’ Algorithm Intuition: For each vertex s ∈ V : 1) Build the SP DAG from s via Dijkstra/BFS; 2) Traverse the SP DAG from the most distant node towards s, in reverse order of distance. During the walk, appropriately increment b(v) of each non-leaf node v traversed. Source s: 1 1 234 567 89 (update to b(v) not shown) Time complexity: O(nm + n2 log n) n Dijkstra’s, plus n backward walks, taking at most n each Too much even with just 104 nodes. 15 / 24
  • 30.
    Modeling / Domainknowledge: High-quality approximations of all BCs are sufficient. 16 / 24
  • 31.
    Modeling / Domainknowledge: High-quality approximations of all BCs are sufficient. Let ε ∈ (0, 1), and δ ∈ (0, 1) be user-specified parameters; An (ε, δ)-approximation is a set {b(x), x ∈ V } of n values s.t. Pr(∃x ∈ V s.t. |b(x) − b(x)| > ε) ≤ δ i.e., with prob. ≥ 1 − δ, for all x ∈ V , b(x) is within ε of b(x): a uniform probabilistic guarantee over all the estimations. 16 / 24
  • 32.
    Algorithmic question: How toobtain an (ε, δ)-approximation quickly? Answer: Sampling Instead of computing all the SPs from each node x ∈ V , compute them only from some randomly chosen nodes (samples). Theory question: How many samples do we need to obtain an (ε, δ)-approximation? The more the better, but really, how many? 17 / 24
  • 33.
    How many samplesdo we need to obtain an (ε, δ)-approximation? Theory: Hoeffding Bound + Union Bound 18 / 24
  • 34.
    How many samplesdo we need to obtain an (ε, δ)-approximation? Theory: Hoeffding Bound + Union Bound Need O 1 ε2 log |V | + log 1 δ samples 18 / 24
  • 35.
    How many samplesdo we need to obtain an (ε, δ)-approximation? Theory: Hoeffding Bound + Union Bound Need O 1 ε2 log |V | + log 1 δ samples Comments Practice: Fewer samples than the above are sufficient for (ε, δ)-approx. Theory: Dependency on |V | and not on edge structure seems wrong. 18 / 24
  • 36.
    How many samplesdo we need to obtain an (ε, δ)-approximation? Theory: Vapnik-Chervonenkis (VC) Dimension Developed to evaluate supervised learning classifiers. We twisted it to work in a non-supervised graph mining problem. “The most practical theory ever” – Me, right now 19 / 24
  • 37.
    How many samplesdo we need to obtain an (ε, δ)-approximation? Theory: Vapnik-Chervonenkis (VC) Dimension Developed to evaluate supervised learning classifiers. We twisted it to work in a non-supervised graph mining problem. “The most practical theory ever” – Me, right now Need O 1 ε2 log diam(G) + log 1 δ samples Decreased sample size exponentially on small-world networks. 19 / 24
  • 38.
    How many samplesdo we need to obtain an (ε, δ)-approximation? Theory: Vapnik-Chervonenkis (VC) Dimension Developed to evaluate supervised learning classifiers. We twisted it to work in a non-supervised graph mining problem. “The most practical theory ever” – Me, right now Need O 1 ε2 log diam(G) + log 1 δ samples Decreased sample size exponentially on small-world networks. Comments Practice: Great improvement but still too many samples. Theory: Graphs with the same diameter are not equally “hard”. 19 / 24
  • 39.
    How many samplesdo we need to obtain an (ε, δ)-approximation? Theory: Progressive sampling + Rademacher Averages Let’s start sampling, use the sample to decide when to stop. 20 / 24
  • 40.
    How many samplesdo we need to obtain an (ε, δ)-approximation? Theory: Progressive sampling + Rademacher Averages Let’s start sampling, use the sample to decide when to stop. Stop when ηi ≤ ε, where ηi is. . . 20 / 24
  • 41.
    How many samplesdo we need to obtain an (ε, δ)-approximation? Theory: Progressive sampling + Rademacher Averages Let’s start sampling, use the sample to decide when to stop. Stop when ηi ≤ ε, where ηi is. . . ηi = 2 min t∈R+ 1 t ln (r,C)∈T et2 r2 /(2S2 i ) + 3 (i + 1) ln(2/δ) 2Si Comments Practice: Getting closer to the empirical bound Theory: Proving stuff is getting complicated (isn’t that good?) 20 / 24
  • 42.
    Theory + Practice: Getrid of “theoretical elegance” while maintaining correctness. 21 / 24
  • 43.
    Theory + Practice: Getrid of “theoretical elegance” while maintaining correctness. Let gS(x, y) = 2 exp −2 x2 (y − 2RF (S))2 + exp − ((1 − x)y + 2xRF (S)) φ 2RF (S) (1 − x)y + 2xRF (S) − 1 . Then compute min x,ξ ξ s.t. gS(x, ξ) ≤ η ξ ∈ (2RF (S), 1] x ∈ (0, 1) and check if ξ < ε. 21 / 24
  • 44.
    To be adata scientist, you need to get your hands dirty in data. To be an algorithmic data scientist, you need to get your hands dirty in    data theory 22 / 24
  • 45.
    Other examples    pattern mining (RademacherAverages) selectivity of database queries (VC-dimension) triangle counting from data streams (non-i.i.d. sampling) graph summarization (Szemerédi Regularity) 23 / 24
  • 46.
    1) Embrace datascience 2) Combine theory and practice 24 / 24
  • 47.
    1) Embrace datascience 2) Combine theory and practice Thank you! EML: matteo@twosigma.com TWTR: @teorionda WWW: http://matteo.rionda.to 24 / 24
  • 48.
    This document isbeing distributed for informational and educational purposes only and is not an offer to sell or the solicitation of an offer to buy any securities or other instruments. The information contained herein is not intended to provide, and should not be relied upon for investment advice. The views expressed herein are not necessarily the views of Two Sigma Investments, LP or any of its affiliates (collectively, “Two Sigma”). Such views reflect significant assumptions and subjective of the author(s) of the document and are subject to change without notice. The document may employ data derived from third-party sources. No representation is made as to the accuracy of such information and the use of such information in no way implies an endorsement of the source of such information or its validity. The copyrights and/or trademarks in some of the images, logos or other material used herein may be owned by entities other than Two Sigma. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa.