Parallize your ML Model Training using Dask

Sometimes you open a big Dataset with Python’s Pandas, try to get a few metrics, and the whole thing just freezes horribly and need to close the notebook are restart the machine.
If you work on Big Data, most of them won't use  Pandas, you can't be waiting fior much long time to process the data , instead you better start using Spark or something.

I found out about this amazing library while searching on the internet about parallelizing the work or training of your model. If your are pythonic guy, a way to speed up Data Analysis in Python, without having to get better infrastructure or switching languages. It will eventually feel limited if your Dataset is huge, but it scales a lot better than regular Pandas, and may be just the fit for your problem.


What is Dask?

Dask is an Open Source project that gives you abstractions over NumPy Arrays, Pandas Dataframes and regular lists, allowing you to run operations on them in parallel, using multicore processing.

Here’s an excerpt straight from the tutorial:

Dask provides high-level Array, Bag, and DataFrame collections that mimic NumPy, lists, and Pandas but can operate in parallel on datasets that don’t fit into main memory. Dask’s high-level collections are alternatives to NumPy and Pandas for large datasets.

It’s as awesome as it sounds! I set out to try the Dask Dataframes out for this Article, and ran a couple benchmarks on them.

It helps to even train a model very fast by utilizing all the workers in your machine

First and foremost thing read the official DocumentationTek2019


Reading the docs

What I did first was read the official documentation, to see what exactly was recommended to do in Dask’s instead of regular Dataframes. Here are the relevant parts from the official docs:

  • Manipulating large datasets, even when those datasets don’t fit in memory
  • Accelerating long computations by using many cores
  • Distributed computing on large datasets with standard Pandas operations like groupby, join, and time series computations

And then below that, it lists some of the things that are really fast if you use Dask Dataframes:

  • Arithmetic operations (multiplying or adding to a Series)
  • Common aggregations (mean, min, max, sum, etc.)
  • Calling value_counts(), drop_duplicates() or corr()

I have demonstrated how to implement Dask to train your ML Model parallel with Dask  have a look at my GitHub notebook to learn
link to notebook : GITHUB

Dask is not used to train a ml model it helps to load a data set into memory like pandas data frame


Part-2 of this will be released soon where you can see a live Dashboard while training your model are performing some operations you can able to see the memory sharing between hardware while computation.

Post a Comment

Previous Post Next Post