Article Image
Article Image
read

We were recently building a Shiny App in which we had to load data from a very large dataframe. It was directly impacting the app initialization time, so we had to look into different ways of reading data from files to R (in this case the customer provided csv files) and identify the best one.

The goal of my post is to compare:

  1. read.csv from utils, which was the standard way of reading csv files to R in RStudio,
  2. read_csv from readr which replaced the former method as a standard way of doing it in RStudio,
  3. load and readRDS from base, and
  4. read_feather from feather and fread from data.table.

Data

First, let’s generate some random data

set.seed(123)
df <- data.frame(replicate(10, sample(0:2000, 15 * 10^5, rep = TRUE)),
                 replicate(10, stringi::stri_rand_strings(1000, 5)))

and save the files on a disk to evaluate the load time. Besides the csv format we will also need feather, RDS and Rdata files.

path_csv <- '../assets/data/fast_load/df.csv'
path_feather <- '../assets/data/fast_load/df.feather'
path_rdata <- '../assets/data/fast_load/df.RData'
path_rds <- '../assets/data/fast_load/df.rds'
library(feather)
library(data.table)
write.csv(df, file = path_csv, row.names = F)
write_feather(df, path_feather)
save(df, file = path_rdata)
saveRDS(df, path_rds)

Next, let’s check our files’ sizes:

files <- c('../assets/data/fast_load/df.csv', '../assets/data/fast_load/df.feather', '../assets/data/fast_load/df.RData', '../assets/data/fast_load/df.rds')
info <- file.info(files)
info$size_mb <- info$size/(1024 * 1024) 
print(subset(info, select=c("size_mb")))
##                                       size_mb
## ../assets/data/fast_load/df.csv     1780.3005
## ../assets/data/fast_load/df.feather 1145.2881
## ../assets/data/fast_load/df.RData    285.4836
## ../assets/data/fast_load/df.rds      285.4837

As we can see both csv and feather format files take up much more storage space. Csv takes up 6 times and feather 4 more comparing to RDS and RData.

Looking to learn more about importing data into R, this DataCamp tutorial covers all you need to know about importing simple text files to more advanced SPSS and SAS files.

Benchmark

We will use microbenchmark library to compare the read times of the following methods:

  • utils::read.csv
  • readr::read_csv
  • data.table::fread
  • base::load
  • base::readRDS
  • feather::read_feather

in 10 rounds.

library(microbenchmark)
benchmark <- microbenchmark(readCSV = utils::read.csv(path_csv),
               readrCSV = readr::read_csv(path_csv, progress = F),
               fread = data.table::fread(path_csv, showProgress = F),
               loadRdata = base::load(path_rdata),
               readRds = base::readRDS(path_rds),
               readFeather = feather::read_feather(path_feather), times = 10)
print(benchmark, signif = 2)
##Unit: seconds
##        expr   min    lq       mean median    uq   max neval
##     readCSV 200.0 200.0 211.187125  210.0 220.0 240.0    10
##    readrCSV  27.0  28.0  29.770890   29.0  32.0  33.0    10
##       fread  15.0  16.0  17.250016   17.0  17.0  22.0    10
##   loadRdata   4.4   4.7   5.018918    4.8   5.5   5.9    10
##     readRds   4.6   4.7   5.053674    5.1   5.3   5.6    10
## readFeather   1.5   1.8   2.988021    3.4   3.6   4.1    10

And the winner is… feather! However, using feather requires prior conversion of the file to the feather format.
Using load or readRDS can improve performance (second and third place in terms of speed) and has an added benefit of storing smaller/compressed file. In both cases you will first have to convert your file to the proper format.

When it comes to reading from csv format fread significantly beats read_csv and read.csv, and thus is the best option to read a csv file.

We decided to go with feather file since converting from csv to this format is just a one time job and we didn’t have a strict limitation on a storage space to consider usaging of Rds or RData format.

The final workflow was:

  1. reading a csv file provided by our customer using fread,
  2. writing it to feather using write_feather, and
  3. loading a feather file on app initialization using read_feather.

The first two tasks were done once and outside of a Shiny App context.

There is also quite an interesting benchmark done by Hadley here on reading complete files to R. Unfortunately, if you use functions defined in that post, you will end up with a character type object and will have to apply string manipulations to obtain a commonly and widely used dataframe.

🎉 Subscribe to our mailing list

Blog Logo

Olga Mierzwa-Sulima


Published

Image

Appsilon Data Science Blog

How to create and use technology to deliver business results.

Back to the top