r - Using dynamic column names in `data.table` -
r - Using dynamic column names in `data.table` -
my question similar 2 other questions on (one , two) couldn't apply on problem.
i have data.table
, want calculate mean values of several columns groupped value of anothe colum.
here example:
library(data.table) txt <- "condition,var1,var2,var3 one,100,1000,10000 one,101,1001,10001 one,102,1002,10002 two,103,1003,10003 two,104,1004,10004 two,105,1005,10005 three,106,1006,10006 three,107,1007,10007 three,108,1008,10008 four,109,1009,10009 four,110,1010,10010" dtb <- data.table(read.delim(textconnection(txt), header=t, sep=',')) print(dtb)
here output:
status var1 var2 var3 1: 1 100 1000 10000 2: 1 101 1001 10001 3: 1 102 1002 10002 4: 2 103 1003 10003 5: 2 104 1004 10004 6: 2 105 1005 10005 7: 3 106 1006 10006 8: 3 107 1007 10007 9: 3 108 1008 10008 10: 4 109 1009 10009 11: 4 110 1010 10010
i want mean values of var1, var2 , var3. each 1 of them pretty easy obtain dtb[, mean(var1), by=condition]
, have this:
status var1 var2 var3 1: 1 101.0 1001.0 10001.0 2: 2 104.0 1004.0 10004.0 3: 3 107.0 1007.0 10007.0 4: 4 109.5 1009.5 10009.5
and of course, need column names dynamnic, i.e dtb[, list(mean(var1), mean(var2), mean(var3)), by=condition]
undesirable.
thank you
you should utilize .sdcols
(especially if you've many columns , require particular operation performed on subset of columns (apart grouping variable columns).
dtb[, lapply(.sd, mean), by=condition, .sdcols=2:4] # status var1 var2 var3 # 1: 1 101.0 1001.0 10001.0 # 2: 2 104.0 1004.0 10004.0 # 3: 3 107.0 1007.0 10007.0 # 4: 4 109.5 1009.5 10009.5
you column names you'd want take mean of first in variable , pass .sdcols
this:
keys <- setdiff(names(dtb), "condition") # keys = var1, var2, var3 dtb[, lapply(.sd, mean), by=condition, .sdcols=keys]
edit: matthew dowle rightly pointed out, since require mean computed on every other column after grouping condition
, do:
dtb[, lapply(.sd, mean), by=condition]
david's edit: (which got rejected): read more .sd
this post. find relevant here. @david.
edit 2: suppose have data.table
1000 rows , 301 columns (one column grouping , 300 numeric columns):
require(data.table) set.seed(45) dt <- data.table(grp = sample(letters[1:15], 1000, replace=t)) m <- matrix(rnorm(300*1000), ncol=300) dt <- cbind(dt, m) setkey(dt, "grp")
and wanted find mean of columns, say, 251:300 alone,
you can compute mean of columns , subset these columns (which not efficient you'll compute on whole data).
dt.out <- dt[, lapply(.sd, mean), by=grp] dim(dt.out) # 15 * 301, not efficient.
you can filter data.table
first these columns , compute mean (which 1 time again not best solution have create subset'd data.table every time want operations on columns.
dt.sub <- dt[, c(1, 251:300), with=false] setkey(dt.sub, "grp") dt.out <- dt.sub[, lapply(.sd, mean), by=grp]
you can specify each of columns 1 1 you'd (but desirable smaller data.tables)
# if need 1 or few columns dt.out <- dt[, list(m.v251 = mean(v251)), = grp]
so what's best solution? reply .sdcols.
as documentation states, data.table x, .sdcols specifies columns included in .sd.
this implicitly filters columns passed .sd instead of creating subset (as did before), efficient , fast!
how can this?
by specifiying either column numbers:
dt.out <- dt[, lapply(.sd, mean), by=grp, .sdcols = 251:300] dim(dt.out) # 15 * 51 (what expect)
or alternatively specifying column id:
ids <- paste0("v", 251:300) # column ids dt.out <- dt[, lapply(.sd, mean), by=grp, .sdcols = ids] dim(dt.out) # 15 * 51 (what expect)
it accepts both column names , numbers arguments. in both these cases, .sd provided these columns we've specified.
hope helps.
r data.table
Comments
Post a Comment