クロス表から元データを復元する - 井出草平の研究ノート

クロス表の再分析をする機会があり、データから元データに変換する必要があったので少し調べてみた。
元データという呼び方はもちろん適切ではない。
呼び方を調べてみたのだがよくわからないのだが、意図としては、ケースが行、変数が列になっているデータ形式である。とりあえず、ここではケースデータと仮に呼んでおく*1。

クロス表、スタック形式とも呼ばれる組み合わせ表、ケースデータの相互変換についてサンプルデータとコードをメモしておく。

クロス表からケースデータへの変換

R: Convert contingency table to long data.frame https://stackoverflow.com/questions/48330888/r-convert-contingency-table-to-long-data-frame

サンプルになるクロス表を作成。

kdat <- data.frame(positive = c(8, 4), 
                   negative = c(3, 6),
                   row.names = c("positive", "negative"))

クロス表は以下。

         positive negative
positive        8        3
negative        4        6

変換にはパイプ演算子を使用する。

library(tidyverse)
kdat2<- kdat %>%
      rownames_to_column() %>%            # set row names as a variable
      gather(rowname2,value,-rowname) %>% # reshape
      rowwise() %>%                       # for every row
      mutate(value = list(1:value)) %>%   # create a series of numbers based on the value
      unnest(value) %>%                   # unnest the counter
      select(-value)                      # remove the counts
kdat2

ちゃんと変換されているか確認。

# # A tibble: 21 x 2
#    rowname  rowname2
#      <chr>    <chr>   
# 1 positive positive
# 2 positive positive
# 3 positive positive
# 4 positive positive
# 5 positive positive
# 6 positive positive
# 7 positive positive
# 8 positive positive
# 9 negative positive
# 10 negative positive
# # ... with 11 more rows

このコードを二次分析で実際に使ってみたところ、余分なケースが作成されていたので、このコードの完成度は高くないかもしれない。僕の経験したバグでは、余分なケースを削除すれば使えたので、問題があるといってもそれほど致命的ではなかった。

ケースデータからスタック形式への変換

Converting between data frames and contingency tables(Cookbook for R) http://www.cookbook-r.com/Manipulating_data/Converting_between_data_frames_and_contingency_tables/#countstocases-function

サンプルになるデータを作成。

dcounts <- data.frame(
    farc=c("absent", "present", "absent", "present","absent", "present","absent", "present", "absent", "present","absent", "present"), 
    pa=c("inside","inside","inside","inside", "5 km buffer", "5 km buffer","5 km buffer", "5 km buffer", "outside", "outside", "outside", "outside"),
    year=c("2017", "2017", "2018", "2018","2017", "2017", "2018", "2018","2017", "2017", "2018", "2018"), 
    Freq=c(263, 136, 241, 870, 218, 157,378,389,8096,877,15070,1602) )

データは以下のようになる。

      farc          pa year  Freq
1   absent      inside 2017   263
2  present      inside 2017   136
3   absent      inside 2018   241
4  present      inside 2018   870
5   absent 5 km buffer 2017   218
6  present 5 km buffer 2017   157
7   absent 5 km buffer 2018   378
8  present 5 km buffer 2018   389
9   absent     outside 2017  8096
10 present     outside 2017   877
11  absent     outside 2018 15070
12 present     outside 2018  1602

関数を作成してから、データを代入する形になっている。

# Convert from data frame of counts to data frame of cases.
# `countcol` is the name of the column containing the counts
countsToCases <- function(x, countcol = "Freq") {
    # Get the row indices to pull from x
    idx <- rep.int(seq_len(nrow(x)), x[[countcol]])

    # Drop count column
    x[[countcol]] <- NULL

    # Get the rows from x
    x[idx, ]
}
dcases<-countsToCases(dcounts)
dcases

スタック形式になっているか確認。

      farc          pa year  Freq
1   absent      inside 2017   263
2  present      inside 2017   136
3   absent      inside 2018   241
4  present      inside 2018   870
5   absent 5 km buffer 2017   218
6  present 5 km buffer 2017   157

こちらのコードはサンプルしか走らせていないので、バグの発生などは確認していない。一見すると問題ない感じはする。

クロス表からスタック形式への変換

こちらはreshape2パッケージが利用できる。
上記のクロス表kdatを利用。

library(reshape2)
kdat3<- melt(cbind(rownames(kdat), kdat))
kdat3

結果は以下。

  rownames(kdat) variable value
1       positive positive     8
2       negative positive     4
3       positive negative     3
4       negative negative     6

書式はdplyrの形式になるが、reshape2の後継tidyrパッケージのpivot_longer()でも同様のことはできるはず(試していない)。

*1:ロング形式と読んでいるものを見つけたのだが、ロング形式というとパネル調査のワイド/ロングで出てくる形式名なので、ロングが適切なのかもよくわからなかった。