{parallelDist}が鬼速い

R Advent Calendar 2019 15日目の記事です。

qiita.com

タイトルだけで内容の99％は伝わっていると思うのですが、{parallelDist} ライブラリが並列でベクトル間の距離計算をしてくれて、鬼のように速いという話をします。

github.com

~~締め切り前なのに距離計算が間に合わない！という時に重宝します。~~

Dynamic Time Warpingとは

今回はDynamic Time Warpingという距離を例にparallelDistの威力を紹介します。
時系列シーケンス間の距離。時系列クラスタリングによく用いられる。
- 平行移動や長さ・周期の違いに対してロバストな類似度を計算できる。
- 先人による解説：

sinhrks.hatenablog.com

一方で計算量がとても大きい。
- 2つのシーケンスを総当りで計算するので、長さnの系列と長さmの系列のDTW距離を計算しようとするとO(nm)かかる。

セッション情報

WSL上で動かしています。

sessionInfo()

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.2 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

パッケージインストール

install.packages("TSclust")
install.packages("dtw")
install.packages("parallelDist")

ライブラリ読み込み

library(magrittr)
library(dplyr)
library(tidyr)
library(purrr)

ランダムウォークによる系列の生成

random_walk <- function(x,t){
  return(cumsum(rnorm(t)))
}

set.seed(1024)
series_data <- map_dfc(1:100,random_walk,t=100) #この記法は重宝しますね

plot_data <- series_data %>% 
  mutate(period = row_number()) %>%
  gather(key = series, value = value,-period)
g <- ggplot(plot_data,aes(x = period, y = value, group = series))
g <- g + geom_line(stat="identity")
g

f:id:meana0:20191215014654p:plain

{TSclust}を使う

DTWを計算する時に一番多く使われているのはこのパッケージかと思います。
- 先人による解説：

sinhrks.hatenablog.com

ドキュメンテーション：https://cran.r-project.org/web/packages/TSclust/index.html

tsclust_time <- system.time({
  tsclust_dist <- TSclust::diss(series_data,"DTWARP")
})
tsclust_time["user.self"]

user.self 
   19.609

{dtw}を使う

距離の正規化の手法などが充実した、柔軟なライブラリ
ドキュメンテーション：http://dtw.r-forge.r-project.org/

dtw_time <- system.time({
  dtw_dist <- dtw::dtwDist(t(series_data)) #dtwの場合は時系列を行として配置
})
dtw_time["user.self"]

user.self 
   22.141

parallelDistを使う

とにかく鬼速く距離計算してくれるライブラリ
何故速いのか？
- RcppParallel によるマルチスレッド計算
- RcppArmadillo による行列計算の最適化
- CRTPによる仮想関数の回避(よくわからない…)
GitHub : https://github.com/alexeckert/parallelDist

pardist_time <- system.time({
  pardist_dist <- parallelDist::parDist(t(series_data),
                                        method = "dtw",
                                        step.pattern="symmetric2") #dtwのデフォルトと合わせる
})
pardist_time["user.self"]

user.self 
    2.922

お、鬼速い…((((；ﾟДﾟ))))
約7倍の速度向上。

結果は同じ？

どのライブラリでも結果は同じなのか恐る恐る確認してみる。

as.matrix(tsclust_dist)[1:10]
as.matrix(dtw_dist)[1:10]
as.matrix(pardist_dist)[1:10]

 [1]    0.0000  399.4580  207.6915  150.0798  733.7981 2201.4722  345.4040
 [8] 1569.0761 1201.3074  415.9536
 [1]    0.0000  399.4580  207.6915  150.0798  733.7981 2201.4722  345.4040
 [8] 1569.0761 1201.3074  415.9536
 [1]    0.0000  399.4580  207.6915  150.0798  733.7981 2201.4722  345.4040
 [8] 1569.0761 1201.3074  415.9536

階層的クラスタリング

せっかくなので階層的クラスタリングも高速な {fastcluster}パッケージを使う。

hc <- fastcluster::hclust(pardist_dist)
plot(hc)
trees <- cutree(hc,3)

f:id:meana0:20191215015024p:plain

クラスタリング結果の可視化

cluster_data <- tibble(series = names(trees),
                       cluster = as.factor(trees))
plot_data <- series_data %>% 
  mutate(period = row_number()) %>%
  gather(key = series, value = value,-period) %>%
  left_join(cluster_data,by="series")
g <- ggplot(plot_data,aes(x = period, y = value, group = series, col = cluster))
g <- g + geom_line(stat="identity")
g

f:id:meana0:20191215015054p:plain

時点数Tと系列数Nと、{dtw}比の計算時間の関係を検証したかったのですが、時間とマシンスペックが足りず…。
- 暇な時にやろうかと思います。

補足

{TSclust}が依存している{rgl}を入れる時にunable to open X11 displayと怒られたが、Sys.setenv("RGL_USE_NULL"=TRUE)で解決した。

SNAGeek

Sociologically Technological, and Technologically Sociological