日々のつれづれ

不惑をむかえ戸惑いを隠せない男性の独り言

read.tableについて少し

久々にRのことを書こうと思います。
つい最近、read.tableのことで質問があり、こんな疑問もあるなあ、と思ったのがきっかけなのデスが。

read.table関数はread.XXX(read.delim, read.csv, read.csv2)系のベースになっている関数で、タブ区切り、CSVでない形式にも柔軟に対応できるのが利点です。
?read.tableをすると、実に多くの引数が準備されていて、あらゆるファイル形式に対応できることが分かります。

いただいた質問というのは、雰囲気から察するに何かのログファイルのようなもので、

  • separatorが"|"
  • number, text, dateが1行に並んでいる
  • headerがない

という特徴がありました。

個人的には、ログファイルはPerlRubyで整形→Rで処理が良いと思います。

でも、Rだけで完了したい人もいるのだと思います。
そう考えると気をつけたいのは、

  • read.tableは文字列を因子として認識する
  • 日付は区切り文字に依存しますが、read.tableで読み込むと因子になる可能性が高い

ということでした。

で、次のようなファイルを"test1.dat"

1|a|2012-06-20
2|b|2012-06-21
3|c|2012-06-22
4|d|2012-06-23

で、保存しました。
2012-06-20はRが指定する日付の表記方法です。

で、区切り文字は引数sep、ヘッダーは引数header、columnの属性を指定するのはcolClassesなので、

> dat1 <- read.table("test1.dat", sep="|", header=FALSE, colClasses=c("integer","character","Date"))
> str(dat1)
'data.frame':	4 obs. of  3 variables:
 $ V1: int  1 2 3 4
 $ V2: chr  "a" "b" "c" "d"
 $ V3: Date, format: "2012-06-20" "2012-06-21" ...

で、次のようなファイルを"test2.dat"
で保存しました。

1|a|2012-06-20|2012.6.20|20/06/2012|06/20/2012|20Jun2012
2|b|2012-06-21|2012.6.21|21/06/2012|06/21/2012|21Jun2012
3|c|2012-06-22|2012.6.22|22/06/2012|06/22/2012|22Jun2012
4|d|2012-06-23|2012.6.23|23/06/2012|06/23/2012|23Jun2012

つまり、2012.6.20は日本表記、20/06/2012と20Jun2012はUS type、06/20/2012はEU typeです。

同じようにcolClassesを使うと、

> dat2 <- read.table("test2.dat", sep="|", header=FALSE, colClasses=c("integer","character","Date","Date","Date","Date","Date"))
 以下にエラー charToDate(x) : 
  character string is not in a standard unambiguous format

つまり、Rが指定する日付表記以外はdateフォーマットとして受け付けないということです。
正確にはread.table関数の引数colClassesが内部で動かしているcharToDate関数のフォーマットにあっていないということです。


内部処理がcharToDateなので、日付は文字で読み込めばなんとかなりそうです。
とにかく読み込んでしまえば、後から何とかできそうです。
確かに、この柔軟性はRの良いところだと思います。

全ての文字列を含むcolumnをfactorでなくcharacterにするのは引数stringsAsFactorsなので、

> dat2 <- read.table("test2.dat", sep="|", header=FALSE, stringsAsFactors=FALSE)
> str(dat2)
'data.frame':	4 obs. of  7 variables:
 $ V1: int  1 2 3 4
 $ V2: chr  "a" "b" "c" "d"
 $ V3: chr  "2012-06-20" "2012-06-21" "2012-06-22" "2012-06-23"
 $ V4: chr  "2012.6.20" "2012.6.21" "2012.6.22" "2012.6.23"
 $ V5: chr  "20/06/2012" "21/06/2012" "22/06/2012" "23/06/2012"
 $ V6: chr  "06/20/2012" "06/21/2012" "06/22/2012" "06/23/2012"
 $ V7: chr  "20Jun2012" "21Jun2012" "22Jun2012" "23Jun2012"

日付は全て文字列になっていることがわかりますので、後は文字列を日付にするだけです。

で、次はdateフォーマットにします。dateに変換する関数はas.Date関数です。
?as.Dateすると、引数xを文字列としてとり、フォーマットを引数formatで指定します。
簡単に示すと、

  • yearは%Yもしくは%y
  • monthは%M, %m(数値)、%B, %b(current localeの月表記)
  • dayは%d
  • year, month, dayのseparatorはcharacterで指定

と決まっています。

なので、数字だけの日付表記は次で大丈夫です。

> # Rのデフォルトなので何もしなくていい
> as.Date(dat2[,3])
[1] "2012-06-20" "2012-06-21" "2012-06-22" "2012-06-23"
> 
> # 上が4桁表記、下が2桁表記
> # JP type
> as.Date(dat2[,4], format="%Y.%m.%d")
[1] "2012-06-20" "2012-06-21" "2012-06-22" "2012-06-23"
> # "12.6.20"なら、as.Date(dat2[,4], format="%y.%m.%d")
> 
> # 数値とセパレータの組み合わせなら、米国系でも欧州系でも
> # US type
> as.Date(dat2[,5], format="%d/%m/%Y")
[1] "2012-06-20" "2012-06-21" "2012-06-22" "2012-06-23"
> 
> # EU type
> as.Date(dat2[,6], format="%m/%d/%Y")
[1] "2012-06-20" "2012-06-21" "2012-06-22" "2012-06-23"

で、次はEnglishのabbreviated monthです。
%Bと%bが使えるのですが、current localeに依存します。

> Sys.time() # この出力が、
[1] "2012-06-23 09:38:35 JST"
> format(Sys.time(), "%Y %b %d %H:%M:%S") # 日本語localeだとJunでなく、6月になる
[1] "2012  6月 23 09:38:35"

おそらくpackageを探せば、適当な関数があるでしょうが、CRANの膨大なpackageを探すのも最近は面倒なので、自作してみます。

monthのabbreviated monthはmonth.abbです。
(因みにmonth.nameはfull spellingです。)

> month.abb
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
> month.name
 [1] "January"   "February"  "March"     "April"     "May"       "June"     
 [7] "July"      "August"    "September" "October"   "November"  "December" 
> x <- gsub("[[:digit:]]","",dat2[,7])
> x <- sapply(seq(x), function(i){
+  sub(x[i], paste("-", match(x[i], month.abb), "-", sep=""), dat2[i,7])
+ })
> x
[1] "20-6-2012" "21-6-2012" "22-6-2012" "23-6-2012"
> as.Date(x, format="%d-%m-%Y")
[1] "2012-06-20" "2012-06-21" "2012-06-22" "2012-06-23"

因みに%Yとか%yはstrftime関数のヘルプが詳しいです。

The details of the formats are system-specific, but the following
are defined by the ISO C99 / POSIX standard for ‘strftime’ and
are likely to be widely available. A _conversion specification_
is introduced by ‘%’, usually followed by a single letter or
‘O’ or ‘E’ and then a single letter. Any character in the
format string not part of a conversion specification is
interpreted literally (and ‘%%’ gives ‘%’). Widely
implemented conversion specifications include

‘%a’ Abbreviated weekday name in the current locale. (Also
matches full name on input.)

‘%A’ Full weekday name in the current locale. (Also matches
abbreviated name on input.)

‘%b’ Abbreviated month name in the current locale. (Also
matches full name on input.)

‘%B’ Full month name in the current locale. (Also matches
abbreviated name on input.)

‘%c’ Date and time. Locale-specific on output, ‘"%a %b %e
%H:%M:%S %Y"’ on input.

‘%d’ Day of the month as decimal number (01-31).

‘%H’ Hours as decimal number (00-23). As a special exception
times such as ‘24:00:00’ are accepted for input, since ISO
8601 allows these.

‘%I’ Hours as decimal number (01-12).

‘%j’ Day of year as decimal number (001-366).

‘%m’ Month as decimal number (01-12).

‘%M’ Minute as decimal number (00-59).

‘%p’ AM/PM indicator in the locale. Used in conjunction with
‘%I’ and *not* with ‘%H’. An empty string in some
locales.

‘%S’ Second as decimal number (00-61), allowing for up to two
leap-seconds (but POSIX-compliant implementations will ignore
leap seconds).

‘%U’ Week of the year as decimal number (00-53) using Sunday as
the first day 1 of the week (and typically with the first
Sunday of the year as day 1 of week 1). The US convention.

‘%w’ Weekday as decimal number (0-6, Sunday is 0).

‘%W’ Week of the year as decimal number (00-53) using Monday as
the first day of week (and typically with the first Monday of
the year as day 1 of week 1). The UK convention.

‘%x’ Date. Locale-specific on output, ‘"%y/%m/%d"’ on input.

‘%X’ Time. Locale-specific on output, ‘"%H:%M:%S"’ on input.

‘%y’ Year without century (00-99). On input, values 00 to 68
are prefixed by 20 and 69 to 99 by 19 - that is the behaviour
specified by the 2004 and 2008 POSIX standards, but they do
also say ‘it is expected that in a future version the
default century inferred from a 2-digit year will change’.

‘%Y’ Year with century. Note that whereas there was no zero in
the original Gregorian calendar, ISO 8601:2004 defines it to
be valid (interpreted as 1BC): see . Note that the
standard also says that years before 1582 in its calendar
should only be used with agreement of the parties involved.

‘%z’ Signed offset in hours and minutes from UTC, so ‘-0800’
is 8 hours behind UTC.

‘%Z’ (output only.) Time zone as a character string (empty if
not available).

Where leading zeros are shown they will be used on output but are
optional on input.

Note that when ‘%z’ or ‘%Z’ is used for output with an object
with an assigned timezone an attempt is made to use the values for
that timezone - but it is not guaranteed to succeed.

Also defined in the current standards but less widely implemented
(e.g. not for output on Windows) are

‘%C’ Century (00-99): the integer part of the year divided by
100.

‘%D’ Date format such as ‘%m/%d/%y’: ISO C99 says it should be
that exact format.

‘%e’ Day of the month as decimal number (1-31), with a leading
space for a single-digit number.

‘%F’ Equivalent to %Y-%m-%d (the ISO 8601 date format).

‘%g’ The last two digits of the week-based year (see ‘%V’).
(Accepted but ignored on input.)

‘%G’ The week-based year (see ‘%V’) as a decimal number.
(Accepted but ignored on input.)

‘%h’ Equivalent to ‘%b’.

‘%k’ The 24-hour clock time with single digits preceded by a
blank.

‘%l’ The 12-hour clock time with single digits preceded by a
blank.

‘%n’ Newline on output, arbitrary whitespace on input.

‘%r’ The 12-hour clock time (using the locale's AM or PM).

‘%R’ Equivalent to ‘%H:%M’.

‘%t’ Tab on output, arbitrary whitespace on input.

‘%T’ Equivalent to ‘%H:%M:%S’.

‘%u’ Weekday as a decimal number (1-7, Monday is 1).

‘%V’ Week of the year as decimal number (00-53) as defined in
ISO 8601. If the week (starting on Monday) containing 1
January has four or more days in the new year, then it is
considered week 1. Otherwise, it is the last week of the
previous year, and the next week is week 1. (Accepted but
ignored on input.)