Processing Big Data Files With R

I often find myself leveraging R on many projects as it have proven itself reliable, robust and fun. Unfortunately, one day I found myself having to process and analyze an Crazy Big ~30GB delimited file. If you do not already know, R, in-short, stores imported data sets in-memory. Meaning I would require ~30GB of RAM and a lot of patience to directly read a ~30 GB file into R. How much memory vs swap would actual be required I do not truly know as my machine turned into a paperweight when I tried to (for the heck of it) to load the file. I like my projects to run quickly and I would not give up on R. I have used fread (data.table package) in the past to quickly read in 2 GB files, but this file was too insurmountable for even fread.

The Master Plan

Since I can not read the entire file in at once, then I must read it in piece-by-piece or not use R – but then I have to change the title of this blog entry.

Here is my plan, 1. figure out the optimal amount of records I can read-in and run through my algorithms at one time 2. read in only, at maximum, this specified amount of records. 3. Click run, get coffee and celebrate results.

For my project I determined that 90,000 records per read was optimal. Lets say on average each record is 100 bytes then a buffer of ~9MB would require ~3,333 iterations through a 30 GB file.

In-case you are going to write a comment about how I should have developed and ETL process and forgone R, well, I chose R as I had to analyze data in the file that could not be loaded into a database or elsewhere and produce a nice knitr PDF report as output. One performance plus – I did not have to read in the entire file, only a percentage of the file.

Simple File Sampling

Being able to exclude some of the data from the file does not give me permission to only read from the begin or the end as this could bias the results, I need a good sample of what is in the file. Here is a fun part. Before looping though the file I determined how many start-stop ranges exist in the file given my buffer, then designate a percentage of those ranges as ‘Keeps’ which will be ran through additional code processing. This sampling percentage is a variable that can change at any time, maybe I sample 5% or 80%, depending on if I am testing the process using a test file or running full-on in production.

How it Works

As stated above, I use the file size and my desired numbReadRecords to initialize my buffer.

rows <- scan(extractFile, what=character(), skip = 2,nlines = 100, fill = FALSE, comment.char = \"\",sep=\"\\n\")

avgRecSize <- floor(object.size(rows)/100)

fileSize <- file.info(crazyBigFile)$size
buf.size <- min(as.double(avgRecSize * numbrReadRecords),fileSize )

I use scan to read in the first 100 records (the more records , the more accurate the avg will be but the slow the read will be). I aquire the individual records by separating at the newline (‘\n’). object.size (util package) is a quick way to determine the allocate space of an object; here I want to determine how many bytes, on average, is a record in my crazy big file. file.info()$size gives me (relatively quickly) the size of the file in bytes. Finally I simply calculate the buffer size my multiplying my average record row size by the number of records I want in the buffer. If the number of records I want is greater then the file size, then my buffer size is my file size.

Creating Read Ranges

Originally I attempted to use fread with the start and stop parameter, which worked great for the first few iterations; unfortunately, each additional iteration became exponentially slower until my machine would catch fire (not in a good way). I ended-up using readChar to fill the buffer, which reads at a consistent pace and does not consume all memory and computing resources.

R – Adding Sequence Ranges to Big Data File

numOfSeq <- floor(fileSizeOrig/buf.size)

sequenceIndex <- 0

sequenceIndex <- as.double(seq(from=buf.size,to=fileSizeOrig, length.out = numOfSeq[1]))

dtSequenceRanges <- rbindlist(lapply(seq_along(sequenceIndex), function(x) { data.frame(start=sequenceIndex[x]-(buf.size-1),stop=sequenceIndex[x]) }), fill=TRUE)

sequenceSampleCnt <- ceiling(samplePct * nrow(dtSequenceRanges))

dtSequenceRangesLimited <- dtSequenceRanges[sample(.N,sequenceSampleCnt)]

dtSequenceRangesLimited$Keep <- TRUE
dtSequenceRanges <- left_join(dtSequenceRanges,dtSequenceRangesLimited)
dtSequenceRanges[is.na(Keep) == TRUE]$Keep <- FALSE
setorder(dtSequenceRanges, start)

The numOfSeq and sequenceIndex are used to determine how many sequences there are, given the file size and buffer size. These values are then used to create a data.table of start and stop ranges.

If I were to use the data.table dtSequenceRanges then I would end up processing all records.

R – Sampling Sequence Ranges of Data File

I instead only want to process a subset of records in the crazy big file by applying a sampling percentage. A new data.table is created that contains the limited subset (sample(.N, sequenceSampleCnt)). sequenceSampleCnt is the size, i.e. number, of ‘sequences’ that should be used for the sampling.

To flag the sequences I want to process I use a Keep field in my data.table. Only the records that were in the limited subset table will have a Keep value of TRUE and thus will be the only sequences that are processed.

Reading the File

Now I am ready to start iterating thought the file one buffer read at a time.

in.file = file( creazyBigFile, "r" )
buf=""
res = list()maxFilePos <- max(dtSequenceRanges$stop)bufferRead <-0
for(j in 1:nrow(dtSequenceRanges))
{

if(bufferRead > maxFilePos)
{

print("Break loop")
break

}#MaxFilePos, bufferSize

buf.size <- dtSequenceRanges[j]$stop - dtSequenceRanges[j]$start

n <- min( c( buf.size, fileSize ) )

buf <-  readChar( in.file, n, useBytes = TRUE )

if(  length(buf)==0)
{
break
}

bufferRead <- bufferRead + n

 

if(dtSequenceRanges[j]$Keep == TRUE)
{

records <- strsplit( buf, "\n", fixed=TRUE, useBytes=TRUE)[[1]]

#Run Code To Process Records

}#end if Keep

}#end for

The code is doing a number of things. First it is initializing the file,maxPos and bufferRead variables. The for loop is against all sequence ranges, even the ones that we don’t want to keep. I experimented with many different file position look-up options and all would reread the file for each stat-stop position call and were thus slower then just doing a straight read through. I did not try doParallel here yet.

The buffer size is actually the different of the sequence range start and stop, this eliminated having to manually account for the smaller last file read.

Everything loops as expected and only ‘breaks’ out of the loop if for some reason a rounding or sequence calculation caused the buffer/sequences to be imperfect and exceed the file size in aggregate.

This line: if(dtSequenceRanges[j]$Keep == TRUE) indicates that we have hit one of our sampling buffers and should process these records. Because these occur randomly though-out the file we don’t have to worry about any bias introduced by obtaining only the first or second half, quarter, whatever of the file.

There might be better ways to process big data files using R; I, however, was unable to find one that suited my needs. I have been using this to process dozens of 1 GB – 60 GB files a day, even joining files on common keys and performing database queries.

I hope this provides some benefits to any facing the same issues I had.

Regards,

Jonathan

Stochastic Coder