日々のつれづれ

不惑をむかえ戸惑いを隠せない男性の独り言

データセットについてのメモ

Rには標準でdatasetパッケージが読み込まれています。
それは関数dataで一覧を取得できます。

> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: i686-pc-linux-gnu (32-bit)
locale:
 [1] LC_CTYPE=ja_JP.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=ja_JP.UTF-8        LC_COLLATE=ja_JP.UTF-8    
 [5] LC_MONETARY=ja_JP.UTF-8    LC_MESSAGES=ja_JP.UTF-8   
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=ja_JP.UTF-8 LC_IDENTIFICATION=C       
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

まとめなきゃな、と思っていたのですが、ここここにまとめができていました。
id:hoxo_mさんありがとう。私がまとめるより素敵なので、活用させていただきます。

  • 例えば、data(infert)の場合。

これは自然流産と人工流産後の不妊症に関するデータです。学歴(Education)、年齢(age)、出産回数(parity)、人工流産回数(induced)、case、過去の自然流産回数(spontaneous)、matched set number(stratum)、stratum number(pooled.stratum) からなるdataframe です。

> data() # datasetの一覧を表示、出力は省略
> 
> # infertの部分だけ出力
> subset(data.frame(data()$results), subset=Item=="infert")
    Package            LibPath   Item
61 datasets /usr/lib/R/library infert
                                                Title
61 Infertility after Spontaneous and Induced Abortion
> 
> is(infert)
[1] "data.frame" "list"       "oldClass"   "vector"    
> str(infert)
'data.frame':	248 obs. of  8 variables:
 $ education     : Factor w/ 3 levels "0-5yrs","6-11yrs",..: 1 1 1 1 2 2 2 2 2 2 ...
 $ age           : num  26 42 39 34 35 36 23 32 21 28 ...
 $ parity        : num  6 1 6 4 3 4 1 2 1 2 ...
 $ induced       : num  1 1 2 2 1 2 0 0 0 0 ...
 $ case          : num  1 1 1 1 1 1 1 1 1 1 ...
 $ spontaneous   : num  2 0 0 0 1 1 0 0 1 0 ...
 $ stratum       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ pooled.stratum: num  3 1 4 2 32 36 6 22 5 19 ...
> 
> summary(infert) # emaple(infert) で使用例を表示
   education        age            parity         induced      
 0-5yrs : 12   Min.   :21.00   Min.   :1.000   Min.   :0.0000  
 6-11yrs:120   1st Qu.:28.00   1st Qu.:1.000   1st Qu.:0.0000  
 12+ yrs:116   Median :31.00   Median :2.000   Median :0.0000  
               Mean   :31.50   Mean   :2.093   Mean   :0.5726  
               3rd Qu.:35.25   3rd Qu.:3.000   3rd Qu.:1.0000  
               Max.   :44.00   Max.   :6.000   Max.   :2.0000  
      case         spontaneous        stratum      pooled.stratum 
 Min.   :0.0000   Min.   :0.0000   Min.   : 1.00   Min.   : 1.00  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:21.00   1st Qu.:19.00  
 Median :0.0000   Median :0.0000   Median :42.00   Median :36.00  
 Mean   :0.3347   Mean   :0.5766   Mean   :41.87   Mean   :33.58  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:62.25   3rd Qu.:48.25  
 Max.   :1.0000   Max.   :2.0000   Max.   :83.00   Max.   :63.00  

もちろん、help(infert)にはもっと詳しい説明がある。

> help(infert)

infert                package:datasets                 R Documentation
Infertility after Spontaneous and Induced Abortion
Description:
     This is a matched case-control study dating from before the
     availability of conditional logistic regression.
Usage:
     infert
Format:
       1.  Education              0 = 0-5  years       
                                  1 = 6-11 years       
                                  2 = 12+  years       
       2.  age                    age in years of case 
       3.  parity                 count                
       4.  number of prior        0 = 0                
           induced abortions      1 = 1                
                                  2 = 2 or more        
       5.  case status            1 = case             
                                  0 = control          
       6.  number of prior        0 = 0                
           spontaneous abortions  1 = 1                
                                  2 = 2 or more        
       7.  matched set number     1-83                 
       8.  stratum number         1-63                 
Note:
     One case with two prior spontaneous abortions and two prior
     induced abortions is omitted.
Source:
     Trichopoulos et al. (1976) _Br. J. of Obst. and Gynaec._ *83*,
     645-650.
Examples:
     require(stats)
     model1 <- glm(case ~ spontaneous+induced, data=infert,family=binomial())
     summary(model1)
     ## adjusted for other potential confounders:
     summary(model2 <- glm(case ~ age+parity+education+spontaneous+induced,
                     data=infert,family=binomial()))
     ## Really should be analysed by conditional logistic regression
     ## which is in the survival package
     if(require(survival)){
       model3 <- clogit(case~spontaneous+induced+strata(stratum),data=infert)
       print(summary(model3))
       detach()# survival (conflicts)
     }

datasetにはexampleもついているので、example(infert)で使い方の一例も確認できるので、便利。
ただ、induced abortionsが5だと思うのですが…

で、infertのデータのうち、例えば年齢と出産回数、年齢と人工流産回数の関係を分割表で見たいときは、こんな感じでしょうか?

> x <- subset(infert,select=c(age, parity, induced))
> x$"age" <- factor(x$"age"<35,levels=c(T,F),labels=c("<35",">=35"))
> x$"parity" <- factor(x$"parity"==1,levels=c(T,F),labels=c("1",">1"))
> x$"induced" <- factor(x$"induced",levels=c(0,1,2),labels=c("0","1","more"))
> 
> table(subset(x,subset=induced!="0",select=c(age, parity)))
      parity
age     1 >1
  <35  17 59
  >=35  6 23
> table(subset(x,subset=induced!="0",select=c(age, induced)))
      induced
age     0  1 more
  <35   0 50   26
  >=35  0 18   11

で、35歳で層別化して出産回数と人工流産回数を見たい場合は、こんな感じでしょうか?

> table(subset(x,subset=age=="<35",select=c(parity, induced)))
      induced
parity  0  1 more
    1  52 17    0
    >1 40 33   26
> table(subset(x,subset=age!="<35",select=c(parity, induced)))
      induced
parity  0  1 more
    1  24  6    0
    >1 27 12   11

このデータセットでは、初産の女性で2回以上人工流産した女性がいないことがわかります。