Learning Dask With Python Distributed Computing¶

Authors¶

Soumil Nitin Shah

Soumil Nitin Shah¶

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |

Website : https://soumilshah.herokuapp.com
Github: https://github.com/soumilshah1995
Linkedin: https://www.linkedin.com/in/shah-soumil/
Blog: https://soumilshah1995.blogspot.com/
Youtube : https://www.youtube.com/channel/UC_eOodxvwS_H7x2uLQa-svw?view_as=subscriber
Facebook Page : https://www.facebook.com/soumilshah1995/
Email : shahsoumil519@gmail.com
projects : https://soumilshah.herokuapp.com/project

Hello! I’m Soumil Nitin Shah, a Software and Hardware Developer based in New York City. I have completed by Bachelor in Electronic Engineering and my Double master’s in Computer and Electrical Engineering. I Develop Python Based Cross Platform Desktop Application , Webpages , Software, REST API, Database and much more I have more than 2 Years of Experience in Python

Step 1: Installation of Dask¶

!pip install dask
!pip install cloudpickle
!pip install "dask[dataframe]"
!pip install "dask[complete]"

!pip show dask

Name: dask
Version: 2020.12.0
Summary: Parallel PyData with Task Scheduling
Home-page: https://github.com/dask/dask/
Author: None
Author-email: None
License: BSD
Location: c:\python38\lib\site-packages
Requires: pyyaml
Required-by: swifter, distributed

# Define the Imports 
try:
    import os
    import json
    import math
    import dask
    from dask.distributed import Client
    import dask.dataframe as dd
    import numpy as np
    import dask.multiprocessing
except Exception as e:
    print("Some Modules are Missing : {} ".format(e))

os.listdir()

['.ipynb_checkpoints',
 'dask-worker-space',
 'netflix_titles.csv',
 'Stackoverflow-issue-',
 'Untitled.ipynb',
 'Youtube.ipynb']

size = os.path.getsize("netflix_titles.csv") / math.pow(1024,3)
print("Size in GB : {} ".format(size))

Size in GB : 0.0022451020777225494

client = Client(n_workers=3, threads_per_worker=1, processes=False, memory_limit='2GB')

client

Read a csv Files¶

dd = dd.read_csv("netflix_titles.csv")

dd

dd.head(1)

dd.tail(1)

dd.shape

(Delayed('int-f166895e-231a-4261-b2de-5fd6325e3522'), 12)

Selecting colums in Dask¶

dd.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

dd["show_id"].head(1)

0    81145628
Name: show_id, dtype: int64

dd.show_id.head(1)

0    81145628
Name: show_id, dtype: int64

dd[["show_id", "title"]].head(3)

dd['type'][dd["show_id"] == 70234439 ].head(1)

2    TV Show
Name: type, dtype: object

Apply Function in dask¶

def toupper(x):
    return x.upper()

dd.title = dd["title"].map(toupper)

dd.title.head(2)

0    NORM OF THE NORTH: KING SIZED ADVENTURE
1                 JANDINO: WHATEVER IT TAKES
Name: title, dtype: object

Apply across cluster¶

tem = [ result.result()  for result in client.map(toupper , dd["title"]) ]

import dask.array as da
permutations = da.from_array(np.asarray(tem))

dd['random'] = permutations

dd.head(1)

Basics of Dask¶

from time import sleep

def inc(x):
    sleep(1)
    return x + 1

def add(x, y):
    sleep(1)

%%time
# This takes three seconds to run because we call each
# function sequentially, one after the other

x = inc(1)
y = inc(2)
z = add(x, y)

Wall time: 3 s

Parallelize with the dask.delayed decorator¶

# This runs immediately, all it does is build a graph
from dask import delayed

x = delayed(inc)(1)
y = delayed(inc)(2)
z = delayed(add)(x, y)

%%time
# This actually runs our computation using a local thread pool

z.compute()

Wall time: 2.02 s

Exercise: Parallelize a for loop¶¶

from time import sleep

def inc(x):
    sleep(1)
    return x + 1

def add(x, y):
    sleep(1)

data = [1, 2, 3, 4, 5, 6, 7, 8]

%%time

# Sequential code
results = []
for x in data:
    y = inc(x)
    results.append(y)
    
total = sum(results)

Wall time: 8.01 s

%%time

results = []

for x in data:
    y = delayed(inc)(x)
    results.append(y)
    
total = delayed(sum)(results)
print("Before computing:", total)  # Let's see what type of thing total is
result = total.compute()
print("After computing :", result)  # After it's computed

Before computing: Delayed('sum-444b98df-760d-4339-bf76-10160ed74812')
After computing : 44
Wall time: 3.25 s

if you want to use decorators¶

from time import sleep

@delayed
def inc(x):
    sleep(1)
    return x + 1

@delayed
def add(x, y):
    sleep(1)

%%time

# this looks like ordinary code

x = inc(15)
y = inc(30)

total = add(x, y)
total.compute()

Wall time: 2.11 s

Pythonist

Saturday, January 30, 2021

Learn about Python Dask in a Easy Way