r - Multi windows range calculations data.table vs dplyr -
i'm doing range calculations (i.e. max , min) on multiple windows on stocks returns. have version in dplyr, many people publishing benchmarking calculations data.table faster. i've created version data.table syntax, it's slower dplyr one. me find better way use data.table make faster? many thanks.
library(quandl) library(tidyr) library(dplyr) library(data.table) library(microbenchmark) tickers <- c("goog/nasdaq_aapl", "goog/nasdaq_msft", "goog/nyse_ibm", "goog/nasdaq_goog") data <- quandl(tickers,transformation = "rdiff") returns <- gather(data, stock, value, -date) %>% separate(stock, c("name", "field"), " - ") %>% filter( field == "close" ) %>% select( - field ) returns_dt <- data.table(returns) multi_window_range <- function(data) { result_1y <- data %>% filter( date >= sys.date() - 365 ) %>% group_by(name) %>% summarise( max_1y = max(value, na.rm = true), min_1y = min(value, na.rm = true) ) result_2y <- data %>% filter( date >= sys.date() - 365 * 2 ) %>% group_by(name) %>% summarise( max_2y = max(value, na.rm = true), min_2y = min(value, na.rm = true) ) result_5y <- data %>% filter( date >= sys.date() - 365 * 5 ) %>% group_by(name) %>% summarise( max_5y = max(value, na.rm = true), min_5y = min(value, na.rm = true) ) return(inner_join(inner_join(result_1y, result_2y, = "name"), result_5y, = "name")) } multi_window_range_dt <- function(data) { setkey(data, name) result_1y <- data[date >= sys.date() - 365, list( max_1y = max(value, na.rm = true), min_1y = min(value, na.rm = true) ), = "name"] result_2y <- data[date >= sys.date() - 365 * 2, list( max_2y = max(value, na.rm = true), min_2y = min(value, na.rm = true) ), = "name"] result_5y <- data[date >= sys.date() - 365 * 5, list( max_5y = max(value, na.rm = true), min_5y = min(value, na.rm = true) ), = "name"] return(result_1y[result_2y][result_5y]) } microbenchmark( multi_window_range(returns), multi_window_range_dt(returns_dt) ) unit: milliseconds expr min lq mean median uq max neval multi_window_range(returns) 6.341532 6.522303 6.915266 6.692666 6.922623 10.16709 100 multi_window_range_dt(returns_dt) 7.537073 7.738516 8.066579 7.865968 8.073114 12.68021 100
try this:
multi_window_range_dt2 <- function(data) { data[, { rng1 <- range(value[date > sys.date() - 365], na.rm = true) rng2 <- range(value[date > sys.date() - 2*365], na.rm = true) rng5 <- range(value[date > sys.date() - 5*365], na.rm = true) list(max_1y = rng1[2], min_1y = rng1[1], max_2y = rng2[2], min_2y = rng2[1], max_5y = rng5[2], min_5y = rng5[1]) }, = "name"] } library(rbenchmark) benchmark(multi_window_range(returns), multi_window_range_dt2(returns_dt))[1:4]
which gives on laptop:
test replications elapsed relative 1 multi_window_range(returns) 100 2.39 1.189 2 multi_window_range_dt2(returns_dt) 100 2.01 1.000
this indicates multi_window_range
takes 18.9% more time multi_window_range_dt2
:
Comments
Post a Comment