Sometimes you open a big Dataset with Python’s Pandas, try to get a few metrics, and the whole thing just freezes horribly and need to close the notebook are restart the machine.
If
you work on Big Data, most of them won't use Pandas, you can't be
waiting fior much long time to process the data , instead you better start using Spark or something.
I
found out about this amazing library while searching on the internet about parallelizing the work or training of your model. If your are pythonic guy, a way to speed up Data
Analysis in Python, without having to get better infrastructure or
switching languages. It will eventually feel limited if your Dataset is
huge, but it scales a lot better than regular Pandas, and may be just
the fit for your problem.
What is Dask?
Dask is an Open Source project that gives you abstractions over NumPy Arrays, Pandas Dataframes and regular lists, allowing you to run operations on them in parallel, using multicore processing.
Here’s an excerpt straight from the tutorial:
Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that don’t fit into main memory. Dask’s high-level collections are alternatives to NumPy and Pandas for large datasets.
It’s as awesome as it sounds! I set out to try the Dask Dataframes out for this Article, and ran a couple benchmarks on them.
It helps to even train a model very fast by utilizing all the workers in your machine
First and foremost thing read the official DocumentationTek2019
Reading the docs
What I did first was read the official documentation, to see what exactly was recommended to do in Dask’s instead of regular Dataframes. Here are the relevant parts from the official docs:
- Manipulating large datasets, even when those datasets don’t fit in memory
- Accelerating long computations by using many cores
- Distributed computing on large datasets with standard Pandas operations like groupby, join, and time series computations
And then below that, it lists some of the things that are really fast if you use Dask Dataframes:
- Arithmetic operations (multiplying or adding to a Series)
- Common aggregations (mean, min, max, sum, etc.)
- Calling value_counts(), drop_duplicates() or corr()