mog project: Finding Community Structure in a Big Graph

C++: ビッググラフにおけるコミュニティ抽出のメモ

手法

ネットワークにおけるコミュニティ抽出 (グラフのクラスタリング) にはいくつかの手法がある。

こちらを参照 => R seminar on igraph

Non-overlapping (strict partition)
- 辺媒介性を用いた手法
  [cond-mat/0112110] Community structure in social and biological networks
- 貪欲 (greedy) アルゴリズムを用いた手法
  [cond-mat/0408187] Finding community structure in very large networks
- スペクトル最適化法
  [physics/0602124] Modularity and community structure in networks
- 焼きなまし法 (Simulated annealing)
  [cond-mat/0603718] Statistical Mechanics of Community Detection
Overlapping
- Cique percolation
  [physics/0506133] Uncovering the overlapping community structure of complex networks in nature and society
- Edge partition
  [0903.2181] Line Graphs, Link Partitions and Overlapping Communities
- Link communities
  [0903.3178] Link communities reveal multiscale complexity in networks
- Overlapping cluster generator
  Multifunctional proteins revealed by overlapping clustering in protein interaction network

スパースなグラフに対する Non-overlapping で最も高速なのは、貪欲アルゴリズムを用いた手法 (O(n (log n)^2))。
今回はこちらを採用。

実装

R であれば igraph – Network analysis software に既に実装がある。

Python の NetworkX では過去に議論された形跡が見られるものの、取り込まれる見込みは薄そうだ。

Fast Algorithm for Finding Community Structure (migrated from Trac #245) · Issue #245 · networkx/networkx

GraphX には、もちろん無い。

今回はとにかく消費メモリを節約したかったので、論文作者本人が公開している C++ の実装を使うことにする。

Structure & Strangeness

ビルド

上記の URL で公開されているソースコードをビルドする。(ライセンスが GPL なので注意)

コンパイルオプションなどが今日のコンパイラに追随していなかったので、若干の修正を要した。

Makefile (-fforce-mem オプション削除)

@@ -8,7 +8,7 @@

 # compiler name and flags
 CCC = g++
-CCFLAGS = -O3 -fomit-frame-pointer -funroll-loops -fforce-mem -fforce-addr -fexpensive-optimizations
+CCFLAGS = -O3 -fomit-frame-pointer -funroll-loops -fforce-addr -fexpensive-optimizations

 # loader flags
 LDFLAGS =

fastcommunity_mh.cc (iostream.h -> iostream, string.h 追加)

@@ -71,12 +71,13 @@
 //
 ////////////////////////////////////////////////////////////////////////

-#include <iostream.h>
+#include <iostream>
 #include <fstream>
 #include <string>
 #include "stdlib.h"
 #include "time.h"
 #include "math.h"
+#include "string.h"

 #include "maxheap.h"
 #include "vektor.h"

maxheap.h (iostream.h -> iostream, namespace std 追加)

@@ -40,7 +40,8 @@
 #if !defined(MAXHEAP_INCLUDED)
 #define MAXHEAP_INCLUDED

-#include <iostream.h>
+#include <iostream>
+using namespace std;

 /*   Because using this heap requires us to be able to modify an arbitrary element's
        data in constant O(1) time, I use to tricky tactic of having elements in an array-

vektor.h (iostream.h -> iostream)

@@ -30,7 +30,7 @@
 #if !defined(vektor_INCLUDED)
 #define vektor_INCLUDED

-#include <iostream.h>
+#include <iostream>

 #if !defined(vektor_INCLUDED)
 #define vektor_INCLUDED

その後、ディレクトリ内で make コマンドを行えば、実行ファイル FastCommunityMH が生成される。

実行

まずは有名(?)な空手クラブの派閥争いのデータで動作確認してみる。

$ curl -O http://tuvalu.santafe.edu/~aaronc/hierarchy/karate.pairs
$ ./FastCommunity_GPL_v1.0.3/FastCommunityMH -f ./karate.pairs -l firstRun

以下の 2ファイルが成果物としてカレントディレクトリに出力される。

FASTCOMMUNITY_INFERENCE_ALGORITHM
START-----:     Sun Nov 22 01:54:13 2015
---FILES--------
D_IN------:     ./
D_OUT-----:     ./
F_IN------:     karate.pairs
F_JOINS---:     karate-fc_firstRun.joins
F_INFO----:     karate-fc_firstRun.info
---NET_STATS----
MAXID-----:     34
NUMNODES--:     34
NUMEDGES--:     78
---MODULARITY---
MAXQ------:     0.383136
STEP------:     32
EXIT------:     Sun Nov 22 01:54:13 2015

-1      -1      -0.0498028      0
17      6       -0.0376397      1
6       7       -0.0139711      2
7       1       -0.00147929     3
5       1       0.0177515       4
11      1       0.0490631       5
27      30      0.0612262       6
30      34      0.0784845       7
24      34      0.0946746       8
28      34      0.111111        9
26      25      0.123192        10
25      32      0.145874        11
13      4       0.157709        12
22      2       0.16905 13
31      9       0.180227        14
9       33      0.196992        15
10      3       0.208169        16
18      2       0.219181        17
12      1       0.229372        18
8       4       0.239563        19
4       3       0.253369        20
14      3       0.269149        21
2       3       0.289448        22
29      32      0.29931 23
32      34      0.311144        24
23      33      0.320513        25
19      33      0.329553        26
21      33      0.338264        27
33      34      0.349359        28
16      34      0.362837        29
15      34      0.375986        30
1       20      0.380671        31
20      3       0.383136        32

この *.joins ファイルを追っていけば、Q値の推移、検出されたクラスタの状態を把握することができる。

-c オプション(特定ステップ時点の状態をファイルに出力)や -v --v オプション(詳細表示)も約に立つが、その度に全データの再計算が行われてしまうので、ビッググラフでは扱いづらい。

ノード数が増えたら、*.joins ファイルを解析するためのツールを作る必要があるだろう。

ビッググラフへの適用

300万ノード、1300万エッジのグラフに対して、上記のプログラムを実行してみた。

期待どおりメモリ使用量は 10GB 以内とかなり少なく抑えられている。

しかし、並列処理には向いていないアルゴリズムなので、それなりの実行時間はかかりそう。
おそらく、完了までは10日程度を要するであろう。

mog project

11.22.2015

Finding Community Structure in a Big Graph

手法

実装

ビルド

実行

ビッググラフへの適用

0 件のコメント:

コメントを投稿