dplyr versus data.table

You might now be wondering, "which package should we use?"

The dplyr and data.table packages provide a spectacularly different syntax and a slightly less determinative difference in performance. Although data.table seems to be slightly more effective on larger datasets, there is no clear winner in this spectrum—except for doing aggregations on a high number of groups. And to be honest, the syntax of dplyr, provided by the magrittr package, can be also used by the data.table objects if needed.

Also, there is another R package that provides pipes in R, called the pipeR package, which claims to be a lot more effective on larger datasets than magrittr. This performance gain is due to the fact that the pipeR operators do not try to be smart like the F# language's |>-compatible operator in magrittr. Sometimes, this performance overhead is estimated to be 5-15 times more than the ones where no pipes are used at all.

One should take into account the community and support behind an R package before spending a reasonable amount of time learning about its usage. In a nutshell, the data.table package is now mature enough, without doubt, for production usage, as the development was started around 6 years ago by Matt Dowle, who was working for a large hedge fund at that time. The development has been continuous since then. Matt and Arun (co-developer of the package) release new features and performance tweaks from time to time, and they both seem to be keen on providing support on the public R forums and channels, such as mailing lists and StackOverflow.

On the other hand, dplyr is shipped by Hadley Wickham and RStudio, one of the most well-known persons and trending companies in the R community, which translates to an even larger user-base, community, and kind-of-instant support on StackOverflow and GitHub.

In short, I suggest using the packages that fit your needs best, after dedicating some time to discover the power and features they make available. If you are coming from an SQL background, you'll probably find data.table a lot more convenient, while others rather opt for the Hadleyverse (take a look at the R package with this name; it installs a bunch of useful R packages developed by Hadley). You should not mix the two approaches in a single project, as both for readability and performance issues, it's better to stick to only one syntax at a time.

To get a deeper understanding of the pros and cons of the different approaches, I will continue to provide multiple implementations of the same problem in the following few pages as well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.28.107