r - Creating a large covariance matrix -


i need create ~110 covariance matrices of doubles size 19347 x 19347 add them together.

this in isn't difficult , smaller matrices following code works fine.

covmat <- matrix(0, ncol=19347, nrow=19347) files<-list.files("path/to/folder/") for(name in files){   text <- readlines(paste("path/to/folder/", name, sep=""),  n=19347, encoding="utf-8")    for(i in 1:19347){     for(k in 1:19347){       covmat[i, k]  <- covmat[i,k] + (as.numeric(text[i]) * as.numeric(text[k]))     }   } } 

to save memory don't calculate each individual matrix add them loops through each file.

the problem when run on real data need use takes far long. there isn't data think cpu , memory intensive job. running ~10 hours doesn't compute result.

i have looked trying use map reduce (aws emr) i've come conclusion don't believe map reduce problem isn't big data problem. here code mapper , reducer have been playing if have been doing wrong.

#mapper text <- readlines("stdin",  n=4, encoding="utf-8") covmat <- matrix(0, ncol=5, nrow=5)  for(i in 1:5){   for(k in 1:5){      covmat[i, k]  <- (as.numeric(text[i]) * as.numeric(text[k]))   } }  cat(covmat)  #reducer trimwhitespace <- function(line) gsub("(^ +)|( +$)", "", line) splitintowords <- function(line) unlist(strsplit(line, "[[:space:]]+")) final <- matrix(0, ncol=19347, nrow=19347) ## **** wo single readlines or in blocks con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = false)) > 0) {      line <- trimwhitespace(line)     words <- splitintowords(line)     final <- final + matrix(as.numeric(words), ncol=19347, nrow=19347) } close(con) cat(final) 

can suggest how solve problem?

thanks in advance

edit

thanks great of commenters below have revised code more efficient.

files<-list.files("path/to/file") covmat <- matrix(0, ncol=19347, nrow = 19347) for(name in files){    invec <- scan(paste("path/to/file", name, sep=""))    covmat <- covmat + outer(invec,invec, "*") } 

here example of file trying process.

1       0.00114582882882883 2      -0.00792611711711709 ...                     ... 19346  -0.00089507207207207 19347  -0.00704709909909909 

on running program still takes ~10mins per file. have advice on how can sped up?

i have 8gb of ram , when program runs r using 4.5gb of , there small amount free.

i running mac os x snow leopard , r 64bit v. 2.15

i have concerns logic in loop. calculating result covmat + outer(in.vec).

   text <- c("1", "5", "8")     for(i in 1:3){      for(k in 1:3){        covmat[i, k]  <-  (as.numeric(text[i]) * as.numeric(text[k]))      }    }  covmat      [,1] [,2] [,3] [1,]    1    5    8 [2,]    5   25   40 [3,]    8   40   64  outer(as.numeric(text),as.numeric(text), "*")      [,1] [,2] [,3] [1,]    1    5    8 [2,]    5   25   40 [3,]    8   40   64 

that doesn't make wrong, can simplified in r, , if want, vectorized function can replace entire inner 2 loops:

invec <- scan(paste("path/to/folder/", name, sep="") covmat <- outer(invec,invec, "*") 

you overwriting each of results successive files outermost loop, not said wanted do, may need decide data structure store matrices in, natural choice being list:

matlist <- list() files<-list.files("path/to/folder/")     for(name in files){          invec <- scan(paste("path/to/folder/", name, sep="")          covmat <- outer(invec,invec, "*")          matlist[[name]] <- covmat                       } 

now 'matlist' should have many matrices there files in directory. can access them name or order of entry. can retrieve names with:

names(matlist) 

Comments

Popular posts from this blog

django - How can I change user group without delete record -

java - Need to add SOAP security token -

java - EclipseLink JPA Object is not a known entity type -