Big data

Large datasets (typically larger than 1 TB) are becoming increasingly common and a lot of resources have been invested in developing technologies capable of collecting, storing, and analyzing them. Typically, the choice of which framework to use is bound to how the data is stored in the first place.

Many times, even if the complete dataset doesn't fit in a single machine, it is still possible to devise strategies to extract the answers without having to probe the whole dataset. For example, it is quite often possible to answer questions by extracting a small, interesting subset of data that can be easily loaded in memory and analyzed with highly convenient and performant libraries, such as Pandas. By filtering or randomly sampling data points, one can often find a good enough answer to a business question without having to resort to big data tools.

If the bulk of the company's software is written in Python, and you have the freedom to decide your software stack, it would make sense to use Dask distributed. The software package has a very simple setup and is tightly integrated with the Python ecosystem. Using something such as Dask array and DataFrame, it's very easy to scale your already-existing Python algorithms by adapting NumPy and Pandas code.

Quite often, some companies may have already set up a Spark cluster. In this case, PySpark is the optimal choice, and the use of SparkSQL is encouraged for higher performance. One of the Spark advantages is that it allows the use of other languages, such as Scala and Java.

Table of Contents for Big data

Create new playlist

Sign In

Sign Up

Table of Contents for
Big data