Rには標準でdatasetパッケージが読み込まれています。
それは関数dataで一覧を取得できます。
> sessionInfo() R version 2.14.1 (2011-12-22) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=ja_JP.UTF-8 LC_NUMERIC=C [3] LC_TIME=ja_JP.UTF-8 LC_COLLATE=ja_JP.UTF-8 [5] LC_MONETARY=ja_JP.UTF-8 LC_MESSAGES=ja_JP.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=ja_JP.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base
まとめなきゃな、と思っていたのですが、こことここにまとめができていました。
id:hoxo_mさんありがとう。私がまとめるより素敵なので、活用させていただきます。
- 例えば、data(infert)の場合。
これは自然流産と人工流産後の不妊症に関するデータです。学歴(Education)、年齢(age)、出産回数(parity)、人工流産回数(induced)、case、過去の自然流産回数(spontaneous)、matched set number(stratum)、stratum number(pooled.stratum) からなるdataframe です。
> data() # datasetの一覧を表示、出力は省略 > > # infertの部分だけ出力 > subset(data.frame(data()$results), subset=Item=="infert") Package LibPath Item 61 datasets /usr/lib/R/library infert Title 61 Infertility after Spontaneous and Induced Abortion > > is(infert) [1] "data.frame" "list" "oldClass" "vector" > str(infert) 'data.frame': 248 obs. of 8 variables: $ education : Factor w/ 3 levels "0-5yrs","6-11yrs",..: 1 1 1 1 2 2 2 2 2 2 ... $ age : num 26 42 39 34 35 36 23 32 21 28 ... $ parity : num 6 1 6 4 3 4 1 2 1 2 ... $ induced : num 1 1 2 2 1 2 0 0 0 0 ... $ case : num 1 1 1 1 1 1 1 1 1 1 ... $ spontaneous : num 2 0 0 0 1 1 0 0 1 0 ... $ stratum : int 1 2 3 4 5 6 7 8 9 10 ... $ pooled.stratum: num 3 1 4 2 32 36 6 22 5 19 ... > > summary(infert) # emaple(infert) で使用例を表示 education age parity induced 0-5yrs : 12 Min. :21.00 Min. :1.000 Min. :0.0000 6-11yrs:120 1st Qu.:28.00 1st Qu.:1.000 1st Qu.:0.0000 12+ yrs:116 Median :31.00 Median :2.000 Median :0.0000 Mean :31.50 Mean :2.093 Mean :0.5726 3rd Qu.:35.25 3rd Qu.:3.000 3rd Qu.:1.0000 Max. :44.00 Max. :6.000 Max. :2.0000 case spontaneous stratum pooled.stratum Min. :0.0000 Min. :0.0000 Min. : 1.00 Min. : 1.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:21.00 1st Qu.:19.00 Median :0.0000 Median :0.0000 Median :42.00 Median :36.00 Mean :0.3347 Mean :0.5766 Mean :41.87 Mean :33.58 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:62.25 3rd Qu.:48.25 Max. :1.0000 Max. :2.0000 Max. :83.00 Max. :63.00
もちろん、help(infert)にはもっと詳しい説明がある。
> help(infert) infert package:datasets R Documentation Infertility after Spontaneous and Induced Abortion Description: This is a matched case-control study dating from before the availability of conditional logistic regression. Usage: infert Format: 1. Education 0 = 0-5 years 1 = 6-11 years 2 = 12+ years 2. age age in years of case 3. parity count 4. number of prior 0 = 0 induced abortions 1 = 1 2 = 2 or more 5. case status 1 = case 0 = control 6. number of prior 0 = 0 spontaneous abortions 1 = 1 2 = 2 or more 7. matched set number 1-83 8. stratum number 1-63 Note: One case with two prior spontaneous abortions and two prior induced abortions is omitted. Source: Trichopoulos et al. (1976) _Br. J. of Obst. and Gynaec._ *83*, 645-650. Examples: require(stats) model1 <- glm(case ~ spontaneous+induced, data=infert,family=binomial()) summary(model1) ## adjusted for other potential confounders: summary(model2 <- glm(case ~ age+parity+education+spontaneous+induced, data=infert,family=binomial())) ## Really should be analysed by conditional logistic regression ## which is in the survival package if(require(survival)){ model3 <- clogit(case~spontaneous+induced+strata(stratum),data=infert) print(summary(model3)) detach()# survival (conflicts) }
datasetにはexampleもついているので、example(infert)で使い方の一例も確認できるので、便利。
ただ、induced abortionsが5だと思うのですが…
で、infertのデータのうち、例えば年齢と出産回数、年齢と人工流産回数の関係を分割表で見たいときは、こんな感じでしょうか?
> x <- subset(infert,select=c(age, parity, induced)) > x$"age" <- factor(x$"age"<35,levels=c(T,F),labels=c("<35",">=35")) > x$"parity" <- factor(x$"parity"==1,levels=c(T,F),labels=c("1",">1")) > x$"induced" <- factor(x$"induced",levels=c(0,1,2),labels=c("0","1","more")) > > table(subset(x,subset=induced!="0",select=c(age, parity))) parity age 1 >1 <35 17 59 >=35 6 23 > table(subset(x,subset=induced!="0",select=c(age, induced))) induced age 0 1 more <35 0 50 26 >=35 0 18 11
で、35歳で層別化して出産回数と人工流産回数を見たい場合は、こんな感じでしょうか?
> table(subset(x,subset=age=="<35",select=c(parity, induced))) induced parity 0 1 more 1 52 17 0 >1 40 33 26 > table(subset(x,subset=age!="<35",select=c(parity, induced))) induced parity 0 1 more 1 24 6 0 >1 27 12 11
このデータセットでは、初産の女性で2回以上人工流産した女性がいないことがわかります。