r - Using dynamic column names in `data.table` -

March 15, 2015

my question similar 2 other questions on (one , two) couldn't apply on problem.

i have data.table , want calculate mean values of several columns groupped value of anothe colum.

here example:

library(data.table) txt <- "condition,var1,var2,var3 one,100,1000,10000 one,101,1001,10001 one,102,1002,10002 two,103,1003,10003 two,104,1004,10004 two,105,1005,10005 three,106,1006,10006 three,107,1007,10007 three,108,1008,10008 four,109,1009,10009 four,110,1010,10010" dtb <- data.table(read.delim(textconnection(txt), header=t, sep=','))  print(dtb)

here output:

    status var1 var2  var3 1:       1  100 1000 10000 2:       1  101 1001 10001 3:       1  102 1002 10002 4:       2  103 1003 10003 5:       2  104 1004 10004 6:       2  105 1005 10005 7:     3  106 1006 10006 8:     3  107 1007 10007 9:     3  108 1008 10008 10:      4  109 1009 10009 11:      4  110 1010 10010

i want mean values of var1, var2 , var3. each 1 of them pretty easy obtain dtb[, mean(var1), by=condition], have this:

    status  var1   var2    var3 1:       1 101.0 1001.0 10001.0 2:       2 104.0 1004.0 10004.0 3:     3 107.0 1007.0 10007.0 4:      4 109.5 1009.5 10009.5

and of course, need column names dynamnic, i.e dtb[, list(mean(var1), mean(var2), mean(var3)), by=condition] undesirable.

thank you

you should utilize .sdcols (especially if you've many columns , require particular operation performed on subset of columns (apart grouping variable columns).

dtb[, lapply(.sd, mean), by=condition, .sdcols=2:4]  #     status  var1   var2    var3 # 1:       1 101.0 1001.0 10001.0 # 2:       2 104.0 1004.0 10004.0 # 3:     3 107.0 1007.0 10007.0 # 4:      4 109.5 1009.5 10009.5

you column names you'd want take mean of first in variable , pass .sdcols this:

keys <- setdiff(names(dtb), "condition") # keys = var1, var2, var3 dtb[, lapply(.sd, mean), by=condition, .sdcols=keys]

edit: matthew dowle rightly pointed out, since require mean computed on every other column after grouping condition, do:

dtb[, lapply(.sd, mean), by=condition]

david's edit: (which got rejected): read more .sd this post. find relevant here. @david.

edit 2: suppose have data.table 1000 rows , 301 columns (one column grouping , 300 numeric columns):

require(data.table) set.seed(45) dt <- data.table(grp = sample(letters[1:15], 1000, replace=t)) m  <- matrix(rnorm(300*1000), ncol=300) dt <- cbind(dt, m) setkey(dt, "grp")

and wanted find mean of columns, say, 251:300 alone,

you can compute mean of columns , subset these columns (which not efficient you'll compute on whole data).

dt.out <- dt[, lapply(.sd, mean), by=grp] dim(dt.out) # 15 * 301, not efficient.

you can filter data.table first these columns , compute mean (which 1 time again not best solution have create subset'd data.table every time want operations on columns.

dt.sub <- dt[, c(1, 251:300), with=false] setkey(dt.sub, "grp") dt.out <- dt.sub[, lapply(.sd, mean), by=grp]

you can specify each of columns 1 1 you'd (but desirable smaller data.tables)

# if need 1 or few columns dt.out <- dt[, list(m.v251 = mean(v251)), = grp]

so what's best solution? reply .sdcols.

as documentation states, data.table x, .sdcols specifies columns included in .sd.

this implicitly filters columns passed .sd instead of creating subset (as did before), efficient , fast!

how can this?

by specifiying either column numbers:

dt.out <- dt[, lapply(.sd, mean), by=grp, .sdcols = 251:300] dim(dt.out) # 15 * 51 (what expect)

or alternatively specifying column id:

ids <- paste0("v", 251:300) # column ids dt.out <- dt[, lapply(.sd, mean), by=grp, .sdcols = ids] dim(dt.out) # 15 * 51 (what expect)

it accepts both column names , numbers arguments. in both these cases, .sd provided these columns we've specified.

hope helps.

r data.table

Search This Blog

Pages Vivanta

r - Using dynamic column names in `data.table` -

Comments

Post a Comment

Popular posts from this blog

web services - java.lang.NoClassDefFoundError: Could not initialize class net.sf.cglib.proxy.Enhancer -

Accessing MATLAB's unicode strings from C -

javascript - mongodb won't find my schema method in nested container -