Community Extracting using Intersection Graph and Content Analysis in Complex Network Toshiya Kuramochi Naoki Okada Kyohei Tanikawa Yoshinori Hijikata Shogo Nishida Graduate School of Engineering Science, Osaka University, Japan The 2012 IEEE/WIC/ACM International Conference on Web Intelligence ENGINEERING SCIENCE
Overview page 2 1. Background and Problems of Community Detection 2. Our Proposed Method 3. Experimentation in Real SNS Networks 4. Results and Discussions 5. Conclusion
Background page 3 Many researchers have studied about complex networks and have found the “community structure” Community structure • connection in groups is densely • connection among groups is sparsely science communities in WWW sets of web pages related to a certain topic business sport Community structure is a key characteristic of complex network
Problem (1) – overlap of communities page 4 Some nodes belong to several communities in real networks communities in WWW community of community of sports pages business pages overlap of communities (e.g., economic effect of the Olympic Games) Most of ordinary clustering methods allocate nodes one cluster They CANNOT represent the overlap of communities Community detection method should be able to allocate nodes several clusters
Problem (2) – edge inhomogeneity page 5 Edges are not homogeneous in real networks edges in SNS network same hobby work place family same university Many community detection methods assume all edges are same They CANNOT represent the edge inhomogeneity Weights of edges should be set individually
Problem (3) – appropriate number of communities page 6 The number of real communities is often unknown How many communities in this network? 4 3 2 Most hierarchical clustering methods require manual input of appropriate number of communities Number of communities should be determined automatically
Purpose of this work page 7 We solve these three problems by proposing a new community detection method • A node may belong to several communities Using the idea of intersection graph [Everett & Borgatti, 1998] • Weights of edges are set individually Content information analysis • Number of communities are automatically determined Clustering based on modularity [Newman, 2003]
Overview page 8 1. Background and Problems of Community Detection 2. Our Proposed Method 3. Experimentation in Real SNS Networks 4. Results and Discussions 5. Conclusion
Summary of our proposed method page 9 Input • Step 1: Enumeration of dense subgraphs • Step 2: Conversion to the intersection graph • Step 3: Calculation of the weights of edges • Step 4: Clustering based on modularity Output Clusters (communities)
Step 1: Enumeration of dense subgraphs page 10 dense subgraph clique, n-clique, n-clan, etc. complete graph example of clique enumeration (example of dense subgraph) { D, E, I, J } (size = 4) K I H clique threshold: threshold J of size of clique B G threshold enumerate D 3 A E 4 C F 5 { B, C, D, J, K } (size = 5) E, G, H } (size = 3) {
Step 2: Conversion to the intersection graph page 11 Z K I H E, G, J B H G {E} D A B, C, D D, E, E C F J, K I, J X { D, J } Y (common member) overlap threshold: threshold of number of common members
Step 3: Calculation of weights of edges page 12 X B C D Y K J E I
Step 4: Clustering based on modularity page 13 automatically detection of best number of clusters
Summary of our proposed method page 14 K I H J B • Step 1: Enumeration G of cliques D A E C F • Step 2: Conversion to Z X the intersection graph Y • Step 3: Calculation of Z weights of edges X Y • Step 4: Clustering Z cluster 2 based on modularity cluster 1 X Y
Overview page 15 1. Background and Problems of Community Detection 2. Our Proposed Method 3. Experimentation in Real SNS Networks 4. Experimental Results 5. Conclusion
Dataset page 16 mixi: one of the most popular SNS in Japan • test subjects: 20 mixi users • link structure: two radius from each test subject • content information: self-introduction, friend introduction, attributes (gender, address, birthday, etc.) ground truth: relation names between a test subject and person in the dataset which is enumerated by the test subject (e.g., „same university‟, „hobby friend‟, „coworker‟) evaluation: we evaluate communities that include test subject • numerical evaluation • precision • recall • F-measure • visual evaluation
Implementation page 17 parameter setting: (clique threshold, overlap threshold) = (3, 2), (4, 2), (4, 3), (5, 2), (5, 3) and (5, 4) implementation: conventional method WithCA NonCA Everett‟s Everett‟s method method* content friend introduction analysis none analysis clustering simple hierarchical clustering based on modularity method clustering # output equals to automatically determined correct data* clusters NonCA * the number of relation names which is enumerated by test subject
Overview page 18 1. Background and Problems of Community Detection 2. Our Proposed Method 3. Experimentation in Real SNS Networks 4. Results and Discussions 5. Conclusion
Numerical evaluation (F-measure) page 19 Everett's method Everett's method* NonCA WithCA 0.5 0.4 F-measure 0.3 0.2 0.1 0 (3, 2) (4, 2) (4, 3) (5, 2) (5, 3) (5, 4) (clique threshold, overlap threshold) Everett Everett* NonCA WithCA Everett bad bad bad superiority of our method Everett* good bad bad NonCA good good contribution of clustering WithCA good good based on modularity
Numerical evaluation (precision) page 20 Everett's method Everett's method* NonCA WithCA 0.8 0.7 0.6 precision 0.5 0.4 0.3 0.2 0.1 0 (3, 2) (4, 2) (4, 3) (5, 2) (5, 3) (5, 4) (clique threshold, overlap threshold) Everett Everett* NonCA WithCA Everett bad Everett* bad NonCA bad the precision become WithCA good good good better with content analysis
Numerical evaluation (recall) page 21 Everett's method Everett's method* NonCA WithCA 0.6 0.5 0.4 recall 0.3 0.2 0.1 0 (3, 2) (4, 2) (4, 3) (5, 2) (5, 3) (5, 4) (clique threshold, overlap threshold) Everett Everett* NonCA WithCA our methods overcome Everett bad bad bad the ordinary method Everett* good bad bad the recall become better by NonCA good good good using clustering method WithCA good good bad based on modularity
Visual evaluation (Everett’s method vs. WithCA) page 22 Everett‟s method WithCA
Visual evaluation (NonCA vs. WithCA) page 23 NonCA WithCA
Overview page 24 1. Background and Problems of Community Detection 2. Our Proposed Method 3. Experimentation in Real SNS Networks 4. Results and Discussions 5. Conclusion
Conclusion page 25 • Features of our proposed method – Our method can allocate nodes several clusters – Our method can represent edge inhomogeneity – Our method can automatically detect the number of clusters • Evaluation on real SNS networks – Our method overcomes conventional method in F- measure – The recall becomes better by using clustering method based on modularity – The precision becomes better with content analysis

Community Extracting Using Intersection Graph and Content Analysis in Complex Network

  • 1.
    Community Extracting usingIntersection Graph and Content Analysis in Complex Network Toshiya Kuramochi Naoki Okada Kyohei Tanikawa Yoshinori Hijikata Shogo Nishida Graduate School of Engineering Science, Osaka University, Japan The 2012 IEEE/WIC/ACM International Conference on Web Intelligence ENGINEERING SCIENCE
  • 2.
    Overview page 2 1. Background and Problems of Community Detection 2. Our Proposed Method 3. Experimentation in Real SNS Networks 4. Results and Discussions 5. Conclusion
  • 3.
    Background page 3 Many researchers have studied about complex networks and have found the “community structure” Community structure • connection in groups is densely • connection among groups is sparsely science communities in WWW sets of web pages related to a certain topic business sport Community structure is a key characteristic of complex network
  • 4.
    Problem (1) –overlap of communities page 4 Some nodes belong to several communities in real networks communities in WWW community of community of sports pages business pages overlap of communities (e.g., economic effect of the Olympic Games) Most of ordinary clustering methods allocate nodes one cluster They CANNOT represent the overlap of communities Community detection method should be able to allocate nodes several clusters
  • 5.
    Problem (2) –edge inhomogeneity page 5 Edges are not homogeneous in real networks edges in SNS network same hobby work place family same university Many community detection methods assume all edges are same They CANNOT represent the edge inhomogeneity Weights of edges should be set individually
  • 6.
    Problem (3) –appropriate number of communities page 6 The number of real communities is often unknown How many communities in this network? 4 3 2 Most hierarchical clustering methods require manual input of appropriate number of communities Number of communities should be determined automatically
  • 7.
    Purpose of thiswork page 7 We solve these three problems by proposing a new community detection method • A node may belong to several communities Using the idea of intersection graph [Everett & Borgatti, 1998] • Weights of edges are set individually Content information analysis • Number of communities are automatically determined Clustering based on modularity [Newman, 2003]
  • 8.
    Overview page 8 1. Background and Problems of Community Detection 2. Our Proposed Method 3. Experimentation in Real SNS Networks 4. Results and Discussions 5. Conclusion
  • 9.
    Summary of ourproposed method page 9 Input • Step 1: Enumeration of dense subgraphs • Step 2: Conversion to the intersection graph • Step 3: Calculation of the weights of edges • Step 4: Clustering based on modularity Output Clusters (communities)
  • 10.
    Step 1: Enumerationof dense subgraphs page 10 dense subgraph clique, n-clique, n-clan, etc. complete graph example of clique enumeration (example of dense subgraph) { D, E, I, J } (size = 4) K I H clique threshold: threshold J of size of clique B G threshold enumerate D 3 A E 4 C F 5 { B, C, D, J, K } (size = 5) E, G, H } (size = 3) {
  • 11.
    Step 2: Conversionto the intersection graph page 11 Z K I H E, G, J B H G {E} D A B, C, D D, E, E C F J, K I, J X { D, J } Y (common member) overlap threshold: threshold of number of common members
  • 12.
    Step 3: Calculationof weights of edges page 12 X B C D Y K J E I
  • 13.
    Step 4: Clusteringbased on modularity page 13 automatically detection of best number of clusters
  • 14.
    Summary of ourproposed method page 14 K I H J B • Step 1: Enumeration G of cliques D A E C F • Step 2: Conversion to Z X the intersection graph Y • Step 3: Calculation of Z weights of edges X Y • Step 4: Clustering Z cluster 2 based on modularity cluster 1 X Y
  • 15.
    Overview page 15 1. Background and Problems of Community Detection 2. Our Proposed Method 3. Experimentation in Real SNS Networks 4. Experimental Results 5. Conclusion
  • 16.
    Dataset page 16 mixi: one of the most popular SNS in Japan • test subjects: 20 mixi users • link structure: two radius from each test subject • content information: self-introduction, friend introduction, attributes (gender, address, birthday, etc.) ground truth: relation names between a test subject and person in the dataset which is enumerated by the test subject (e.g., „same university‟, „hobby friend‟, „coworker‟) evaluation: we evaluate communities that include test subject • numerical evaluation • precision • recall • F-measure • visual evaluation
  • 17.
    Implementation page 17 parameter setting: (clique threshold, overlap threshold) = (3, 2), (4, 2), (4, 3), (5, 2), (5, 3) and (5, 4) implementation: conventional method WithCA NonCA Everett‟s Everett‟s method method* content friend introduction analysis none analysis clustering simple hierarchical clustering based on modularity method clustering # output equals to automatically determined correct data* clusters NonCA * the number of relation names which is enumerated by test subject
  • 18.
    Overview page 18 1. Background and Problems of Community Detection 2. Our Proposed Method 3. Experimentation in Real SNS Networks 4. Results and Discussions 5. Conclusion
  • 19.
    Numerical evaluation (F-measure) page 19 Everett's method Everett's method* NonCA WithCA 0.5 0.4 F-measure 0.3 0.2 0.1 0 (3, 2) (4, 2) (4, 3) (5, 2) (5, 3) (5, 4) (clique threshold, overlap threshold) Everett Everett* NonCA WithCA Everett bad bad bad superiority of our method Everett* good bad bad NonCA good good contribution of clustering WithCA good good based on modularity
  • 20.
    Numerical evaluation (precision) page 20 Everett's method Everett's method* NonCA WithCA 0.8 0.7 0.6 precision 0.5 0.4 0.3 0.2 0.1 0 (3, 2) (4, 2) (4, 3) (5, 2) (5, 3) (5, 4) (clique threshold, overlap threshold) Everett Everett* NonCA WithCA Everett bad Everett* bad NonCA bad the precision become WithCA good good good better with content analysis
  • 21.
    Numerical evaluation (recall) page 21 Everett's method Everett's method* NonCA WithCA 0.6 0.5 0.4 recall 0.3 0.2 0.1 0 (3, 2) (4, 2) (4, 3) (5, 2) (5, 3) (5, 4) (clique threshold, overlap threshold) Everett Everett* NonCA WithCA our methods overcome Everett bad bad bad the ordinary method Everett* good bad bad the recall become better by NonCA good good good using clustering method WithCA good good bad based on modularity
  • 22.
    Visual evaluation (Everett’smethod vs. WithCA) page 22 Everett‟s method WithCA
  • 23.
    Visual evaluation (NonCAvs. WithCA) page 23 NonCA WithCA
  • 24.
    Overview page 24 1. Background and Problems of Community Detection 2. Our Proposed Method 3. Experimentation in Real SNS Networks 4. Results and Discussions 5. Conclusion
  • 25.
    Conclusion page 25 • Features of our proposed method – Our method can allocate nodes several clusters – Our method can represent edge inhomogeneity – Our method can automatically detect the number of clusters • Evaluation on real SNS networks – Our method overcomes conventional method in F- measure – The recall becomes better by using clustering method based on modularity – The precision becomes better with content analysis