Learning Dask With Python Distributed Computing¶
Authors¶
- Soumil Nitin Shah
Soumil Nitin Shah¶
Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |
- Website : https://soumilshah.herokuapp.com
- Github: https://github.com/soumilshah1995
- Linkedin: https://www.linkedin.com/in/shah-soumil/
- Blog: https://soumilshah1995.blogspot.com/
- Youtube : https://www.youtube.com/channel/UC_eOodxvwS_H7x2uLQa-svw?view_as=subscriber
- Facebook Page : https://www.facebook.com/soumilshah1995/
- Email : shahsoumil519@gmail.com
- projects : https://soumilshah.herokuapp.com/project
Hello! I’m Soumil Nitin Shah, a Software and Hardware Developer based in New York City. I have completed by Bachelor in Electronic Engineering and my Double master’s in Computer and Electrical Engineering. I Develop Python Based Cross Platform Desktop Application , Webpages , Software, REST API, Database and much more I have more than 2 Years of Experience in Python
Step 1: Installation of Dask¶
- !pip install dask
- !pip install cloudpickle
- !pip install "dask[dataframe]"
- !pip install "dask[complete]"
!pip show dask
# Define the Imports
try:
import os
import json
import math
import dask
from dask.distributed import Client
import dask.dataframe as dd
import numpy as np
import dask.multiprocessing
except Exception as e:
print("Some Modules are Missing : {} ".format(e))
os.listdir()
size = os.path.getsize("netflix_titles.csv") / math.pow(1024,3)
print("Size in GB : {} ".format(size))
client = Client(n_workers=3, threads_per_worker=1, processes=False, memory_limit='2GB')
client
Read a csv Files¶
dd = dd.read_csv("netflix_titles.csv")
dd
dd.head(1)
dd.tail(1)
dd.shape
Selecting colums in Dask¶
dd.columns
dd["show_id"].head(1)
dd.show_id.head(1)
dd[["show_id", "title"]].head(3)
dd['type'][dd["show_id"] == 70234439 ].head(1)
Apply Function in dask¶
def toupper(x):
return x.upper()
dd.title = dd["title"].map(toupper)
dd.title.head(2)
Apply across cluster¶
tem = [ result.result() for result in client.map(toupper , dd["title"]) ]
import dask.array as da
permutations = da.from_array(np.asarray(tem))
dd['random'] = permutations
dd.head(1)
Basics of Dask¶
from time import sleep
def inc(x):
sleep(1)
return x + 1
def add(x, y):
sleep(1)
%%time
# This takes three seconds to run because we call each
# function sequentially, one after the other
x = inc(1)
y = inc(2)
z = add(x, y)
Parallelize with the dask.delayed decorator¶
# This runs immediately, all it does is build a graph
from dask import delayed
x = delayed(inc)(1)
y = delayed(inc)(2)
z = delayed(add)(x, y)
%%time
# This actually runs our computation using a local thread pool
z.compute()
Exercise: Parallelize a for loop¶¶
from time import sleep
def inc(x):
sleep(1)
return x + 1
def add(x, y):
sleep(1)
data = [1, 2, 3, 4, 5, 6, 7, 8]
%%time
# Sequential code
results = []
for x in data:
y = inc(x)
results.append(y)
total = sum(results)
%%time
results = []
for x in data:
y = delayed(inc)(x)
results.append(y)
total = delayed(sum)(results)
print("Before computing:", total) # Let's see what type of thing total is
result = total.compute()
print("After computing :", result) # After it's computed
if you want to use decorators¶
from time import sleep
@delayed
def inc(x):
sleep(1)
return x + 1
@delayed
def add(x, y):
sleep(1)
%%time
# this looks like ordinary code
x = inc(15)
y = inc(30)
total = add(x, y)
total.compute()