May 15, 2015

Why?

Embarassingly parallel tasks

parallel processes:

  • Bootstrapping
  • Cross-validation
  • Simulating independent random variables (dorng)

non-parallel processes:

  • MCMC algorithms
  • Several types of model selection (e.g.: step() or the LARS algorithm for LASSO)

What to do

Options

  • Changing from a for loop to one of the apply() functions can help, but still doesn't use multiple processors.
  • Use the parallel package (thanks, Miranda!).
  • Don't use R.
  • Use the foreach package! (Analytics and Weston 2014)

Why foreach?

  • Make use of our whole computer
  • Without having to invest large amounts of time in learning new programming languages
  • Our goal: transform a for loop into a foreach loop

Example: data and research question

citibike nyc

Goal: predict arrival volume to inform management of bike stations

  • 7 busiest locations from May 2014
  • response: # of arrivals each hour of every day in the month
  • covariates: hour of the day and whether the day is a weekend