Rで系列パターンマイニング

頻出パターンマイニング(Frequent pattern mining)は,頻出するアイテムの組み合わせを抽出する一連の手法を指します.頻出パターンマイニングの代表例として相関ルールのマイニングがありますが,この手法を用いるとPOSデータからビールとおむつを一緒に購入する消費者が多いことなどの知見が得られます.

一方で,頻出パターンマイニングでは,順序性のあるパターンは抽出できません.例えば,ビールを購入した消費者がその後おむつを購入する傾向があることは,頻出パターンマイニングでは分かりません.

このように順序性のあるパターンを抽出する手法は,系列パターンマイニング(Sequential pattern mining)と呼ばれており,１９９５年にIBM研究所のR.AgrawalとR.Srikantによって提唱されました.

RのarulesSequencesパッケージを使用すると,系列パターンマイニングを実行できます.理論的な話は後回しにして,このパッケージを使ってみましょう.なお,現状ではarulesSequencesパッケージはLINUX/UNIXのみ対応しているとのことですので,注意が必要です.

今回,使用するデータはdelicios-sequecesデータセットです.このデータセットは,ソーシャルブックマークサイトであるdelicious.comで,ユーザがWebサイトをブックマークしたときに与えたタグを保持しています.

データをダウンロードして,"delicious-sequence.txt"という名前で保存します.データ形式は次のとおりです.各レコードが「あるユーザがあるWebサイトをブックマークしてタグ付けを行う」というイベントを表しており,各カラムは左から順にユーザID, イベントID, タグの数, タグリストです.

1 1 2 WebService Font
2 1 5 design Webdesign inspiration icons images
3 1 3 ubuntu linux util
4 1 1 Apps
5 1 1 board
6 1 2 cake, cinnamon
7 1 1 education
7 2 1 educators
7 3 1 math
7 4 1 education
・・・

それでは,このデータを読み込みましょう.

> install.packages("arulesSequences")
> library(arulesSequences)
> # データの読み込み
> ds <- read_baskets("../data/delicious-sequence.txt", info = c("sequenceID","eventID","SIZE"))

読み込んだデータの構造や要約情報は次のようにして確認します.

> ds
transactions in sparse format with
 7559 transactions (rows) and
 7496 items (columns)
> # オブジェクトの構造の確認
> str(ds)
Formal class 'transactions' [package "arules"] with 4 slots
  ..@ transactionInfo:'data.frame':	7559 obs. of  3 variables:
  .. ..$ sequenceID: int [1:7559] 1 2 3 4 5 6 7 7 7 7 ...
  .. ..$ eventID   : int [1:7559] 1 1 1 1 1 1 1 2 3 4 ...
  .. ..$ SIZE      : Factor w/ 20 levels "1","10","11",..: 12 16 14 1 1 12 1 1 1 1 ...
  ..@ data           :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots
  .. .. ..@ i       : int [1:25401] 553 1435 1439 2800 4020 4062 4142 4540 6947 7019 ...
  .. .. ..@ p       : int [1:7560] 0 2 7 10 11 12 14 15 16 17 ...
  .. .. ..@ Dim     : int [1:2] 7496 7559
  .. .. ..@ Dimnames:List of 2
  .. .. .. ..$ : NULL
  .. .. .. ..$ : NULL
  .. .. ..@ factors : list()
  ..@ itemInfo       :'data.frame':	7496 obs. of  1 variable:
  .. ..$ labels:Class 'AsIs'  chr [1:7496] "&#1608;&#1576;" "&#51089;&#44032;" "&#44396;&#44544;" "&#53944;&#50948;&#53552;" ...
  ..@ itemsetInfo    :'data.frame':	7559 obs. of  1 variable:
  .. ..$ itemsetID: Factor w/ 7559 levels "1","10","100",..: 1 1112 2223 3334 4445 5556 6667 7338 7449 2 ...
> # 要約情報
> summary(ds)
transactions as itemMatrix in sparse format with
 7559 rows (elements/itemsets/transactions) and
 7496 columns (items) and a density of 0.0004482878 

most frequent items:
     design       tools        blog   webdesign inspiration     (Other) 
        469         301         233         229         220       23949 

element (itemset/transaction) length distribution:
sizes
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
2283 1432 1172  825  560  343  230  273  171  100   60   34   25   14    5    5 
  17   18   19   20 
   5    7    8    7 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    1.00    3.00    3.36    4.00   20.00 

includes extended item information - examples:
  labels
1      |
2      -
3      ,

includes extended transaction information - examples:
  sequenceID eventID SIZE
1          1       1    2
2          2       1    5
3          3       1    3

以上の要約情報の意味を簡単に説明しておきます.

頻出アイテム：出現数が多い順に,"design"(469回), "tools"(301回), "blog"(233回), "webdesign"(229回), "inspiration"(220回).
トランザクションあたりのタグ付け数: 1個（2283回),2個(1432回),3個(1172回),4個(825回),5個(560回）

続いて,このトランザクションデータに対して系列パターンマイニングを実行してみましょう.arulesSequencesパッケージでは,cspade関数を用いることにより,系列パターンマイニングの手法の一つであるSPADEアルゴリズムが利用できます.

> ds.sp <- cspade(ds, parameter=list(support=0.001), control=list(verbose=TRUE)) 

parameter specification:
support : 0.001
maxsize :    10
maxlen  :    10

algorithmic control:
bfstype : FALSE
verbose :  TRUE
summary : FALSE

preprocessing ... 1 partition(s), 0.33 MB [0.058s]
mining transactions ... 0.06 MB [0.092s]
reading sequences ... [0.48s]

total elapsed time: 0.633s

結果をデータフレームに変換して,先頭と末尾を確認します.

> ds.sp.df <- as(ds.sp, "data.frame")
> head(ds.sp.df)
  sequence     support
1 <{.net}> 0.004038257
2  <{2.0}> 0.001912859
3 <{2009}> 0.002550478
4   <{3D}> 0.001062699
5   <{3d}> 0.004675877
6 <{9/11}> 0.002550478
> tail(ds.sp.df)
                            sequence     support
3091 <{art,inspiration},{art},{art}> 0.001062699
3092      <{art,design},{art},{art}> 0.001275239
3093    <{art},{design},{art},{art}> 0.001062699
3094       <{art},{art},{art},{art}> 0.001062699
3095           <{applications,apps}> 0.001487779
3096             <{design},{agency}> 0.001275239

この結果を見ると,

".net"というタグ付けを行ったユーザ数の全ユーザに対する比が0.004038257であり,最も支持度が大きいこと.
2つの異なるWebサイトに対して"design"と"agency"をこの順にタグ付けしたユーザ数の全ユーザに対する比が0.001275238であり,支持度が0.001以上のパターンの中では最も信頼度が小さいこと.

などが分かります.より詳細にこれらのパターンを分析すれば,さらに知見を得ることができるでしょう.

以上では,RのarulesSequncesパッケージを使用して系列パターンマイニングを実行しました.アルゴリズムの説明が必要だと思いますが,それについてはまた別途このブログで書いていきたいと思います.

参考文献