r - Creating a large covariance matrix -
i need create ~110 covariance matrices of doubles size 19347 x 19347 add them together.
this in isn't difficult , smaller matrices following code works fine.
covmat <- matrix(0, ncol=19347, nrow=19347) files<-list.files("path/to/folder/") for(name in files){ text <- readlines(paste("path/to/folder/", name, sep=""), n=19347, encoding="utf-8") for(i in 1:19347){ for(k in 1:19347){ covmat[i, k] <- covmat[i,k] + (as.numeric(text[i]) * as.numeric(text[k])) } } }
to save memory don't calculate each individual matrix add them loops through each file.
the problem when run on real data need use takes far long. there isn't data think cpu , memory intensive job. running ~10 hours doesn't compute result.
i have looked trying use map reduce (aws emr) i've come conclusion don't believe map reduce problem isn't big data problem. here code mapper , reducer have been playing if have been doing wrong.
#mapper text <- readlines("stdin", n=4, encoding="utf-8") covmat <- matrix(0, ncol=5, nrow=5) for(i in 1:5){ for(k in 1:5){ covmat[i, k] <- (as.numeric(text[i]) * as.numeric(text[k])) } } cat(covmat) #reducer trimwhitespace <- function(line) gsub("(^ +)|( +$)", "", line) splitintowords <- function(line) unlist(strsplit(line, "[[:space:]]+")) final <- matrix(0, ncol=19347, nrow=19347) ## **** wo single readlines or in blocks con <- file("stdin", open = "r") while (length(line <- readlines(con, n = 1, warn = false)) > 0) { line <- trimwhitespace(line) words <- splitintowords(line) final <- final + matrix(as.numeric(words), ncol=19347, nrow=19347) } close(con) cat(final)
can suggest how solve problem?
thanks in advance
edit
thanks great of commenters below have revised code more efficient.
files<-list.files("path/to/file") covmat <- matrix(0, ncol=19347, nrow = 19347) for(name in files){ invec <- scan(paste("path/to/file", name, sep="")) covmat <- covmat + outer(invec,invec, "*") }
here example of file trying process.
1 0.00114582882882883 2 -0.00792611711711709 ... ... 19346 -0.00089507207207207 19347 -0.00704709909909909
on running program still takes ~10mins per file. have advice on how can sped up?
i have 8gb of ram , when program runs r using 4.5gb of , there small amount free.
i running mac os x snow leopard , r 64bit v. 2.15
i have concerns logic in loop. calculating result covmat + outer(in.vec).
text <- c("1", "5", "8") for(i in 1:3){ for(k in 1:3){ covmat[i, k] <- (as.numeric(text[i]) * as.numeric(text[k])) } } covmat [,1] [,2] [,3] [1,] 1 5 8 [2,] 5 25 40 [3,] 8 40 64 outer(as.numeric(text),as.numeric(text), "*") [,1] [,2] [,3] [1,] 1 5 8 [2,] 5 25 40 [3,] 8 40 64
that doesn't make wrong, can simplified in r, , if want, vectorized function can replace entire inner 2 loops:
invec <- scan(paste("path/to/folder/", name, sep="") covmat <- outer(invec,invec, "*")
you overwriting each of results successive files outermost loop, not said wanted do, may need decide data structure store matrices in, natural choice being list:
matlist <- list() files<-list.files("path/to/folder/") for(name in files){ invec <- scan(paste("path/to/folder/", name, sep="") covmat <- outer(invec,invec, "*") matlist[[name]] <- covmat }
now 'matlist' should have many matrices there files in directory. can access them name or order of entry. can retrieve names with:
names(matlist)
Comments
Post a Comment