getting it wrong with R

by Michael Werneburg
on 2017.07.23

You are here:
Risk topics
» Risk topics blog
July, 2017
· getting it wrong with R
· de-identifying health information
· that's a lot of tracking!

June, 2017
· gaming Google news
· privacy in this day and age
· another record breach
· writing an industry standard
· ISACA article accepted

May, 2017
· Covey time-management quadrants
· safe harbor de-identification of health data
· an ISACA article

April, 2017
· my guide on managing third party risk
· PMP for five years
· metrics that matter
· 720 reads in 48 hours
· I lost my job

March, 2017
· farewell, SIRA board
· the message and the medium
· an interesting take on consulting

February, 2017
· the ever-expanding sh*tlist
· claiming professional expenses in Canada
· get cyber safe
· the flight of the wealthy

January, 2017
· virtual kidnapping
· financial industry vendor management

November, 2016
· securing your life
· yet another reason to patch

October, 2016
· DNS subdomain discovery
· fintech and information risk

September, 2016
· on failed persons

July, 2016
· how to sabotage innovative projects

June, 2016
· no fix for cyber security in our lifetime


I'm taking a "MOOC" on Coursera in data science. There's an R programming element to it, and I'm currently taking that—the second—class.

Today I spent a few hours doing a twenty minute assignment because I mis-read it. But if anyone's interested in a system by which you can fairly quickly read a raft of (similarly formatted) CSV files into one matrix, here's a way of doing so.


corr <- function(directory, threshold = 0) { # 'directory' is a name of a valid subdirectory # 'threshold' is an optional cut-off for retention # of the records in any file

# step zero, set up a matrix with the two critical # fields from the files dat = matrix(data=NA,nrow=0,ncol=2, byrow=TRUE) colnames(dat) <- c("sulfate", "nitrate")

list <- list.files(directory, all.files=TRUE, full.names=TRUE, recursive = TRUE) for (filename in list) { if (grepl(".csv", filename) == FALSE) { next }

# e.g. poldata <- read.csv(file="specdata/002.csv", header=TRUE, sep=",",

poldata <- read.csv(file=filename, header=TRUE, sep=",",

# removes any incomplete records poldata <- poldata[complete.cases(poldata),]

# get a count of good records in the file rowsGood <- nrow(poldata)

if (rowsGood >= threshold) { # this was by far the fastest route I could find

# 1. cast the just-loaded data.frame as a matrix matrix <- as.matrix(poldata[c("sulfate","nitrate")])

# 2. bulk-copy the records (using plyr library) dat <- rbind.fill.matrix(dat,matrix) }


cor(data.frame(dat[,1], dat[,2])) }

Again, this is not the assignment from the Coursera course, this is something more difficult. I misread it while in the middle of one of my damn headaches because I was working against a deadline. I probably would have been better served by resting for that time, then reading the assignment correctly.

big list