Pandas apply function parallelize¶
Soumil Nitin Shah¶
Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |
- Website : https://soumilshah.herokuapp.com
- Github: https://github.com/soumilshah1995
- Linkedin: https://www.linkedin.com/in/shah-soumil/
- Blog: https://soumilshah1995.blogspot.com/
- Youtube : https://www.youtube.com/channel/UC_eOodxvwS_H7x2uLQa-svw?view_as=subscriber
- Facebook Page : https://www.facebook.com/soumilshah1995/
- Email : shahsoumil519@gmail.com
We have got a huge pandas data frame, and we want to apply a complex function to it which takes a lot of time.¶
In [1]:
import os
import json
import pandas as pd
import tqdm
import datetime
import time
In [4]:
df = pd.read_csv("netflix_titles.csv",chunksize=20)
df = df.dropna()
In [5]:
df.shape
Out[5]:
In [6]:
df.head(2)
Out[6]:
Say i need to apply Several transformation on my Data lets take a simple example¶
In [11]:
def some_expensive_computation(x):
time.sleep(0.01)
return x
def apply_transform(df):
df['title'] = df['title'].apply(lambda x: x.lower())
df['director'] = df['director'].apply(lambda x: x.lower())
df['cast'] = df['cast'].apply(lambda x: x.lower())
df['description'] = df['description'].apply(some_expensive_computation)
return df
In [12]:
start = datetime.datetime.now()
df1 = apply_transform(df)
end = datetime.datetime.now()
print("-"*44)
print("Execution Time: {} ".format(end-start))
print("-"*44)
In [13]:
from tqdm import tqdm
tqdm.pandas()
In [14]:
def some_expensive_computation(x):
time.sleep(0.01)
return x
def apply_transform(df):
df['title'] = df['title'].progress_apply(lambda x: x.lower())
df['director'] = df['director'].progress_apply(lambda x: x.lower())
df['cast'] = df['cast'].progress_apply(lambda x: x.lower())
df['description'] = df['description'].progress_apply(some_expensive_computation)
return df
In [15]:
start = datetime.datetime.now()
df1 = apply_transform(df)
end = datetime.datetime.now()
print("-"*44)
print("Execution Time: {} ".format(end-start))
print("-"*44)
Well ok now i see a progress bar atleast hmm how can i speed up and what are different things i can use ?¶
Method 1¶
In [1]:
from multiprocessing import Pool
import numpy as np
import os
import json
import pandas as pd
import tqdm
import datetime
import time
import swifter
df = pd.read_csv("netflix_titles.csv")
df = df.dropna()
In [2]:
def some_expensive_computation(x):
time.sleep(0.01)
return x
def apply_transform(df):
df['title'] = df['title'].swifter.apply(lambda x: x.lower())
df['director'] = df['director'].swifter.apply(lambda x: x.lower())
df['cast'] = df['cast'].swifter.apply(lambda x: x.lower())
df['description'] = df['description'].swifter.apply(some_expensive_computation)
return df
In [3]:
start = datetime.datetime.now()
df1 = apply_transform(df)
end = datetime.datetime.now()
print("-"*44)
print("Execution Time: {} ".format(end-start))
print("-"*44)
Method 2¶
- try changing core number and you might get results that vary
In [1]:
from multiprocessing import Pool
import numpy as np
import os
import json
import pandas as pd
import tqdm
import datetime
import time
import mapply
import time
df = pd.read_csv("netflix_titles.csv")
df = df.dropna()
mapply.init(n_workers=1)
In [2]:
def some_expensive_computation(x):
time.sleep(0.01)
return x
def apply_transform(df):
df['title'] = df['title'].mapply(lambda x: x.lower())
df['director'] = df['director'].mapply(lambda x: x.lower())
df['cast'] = df['cast'].mapply(lambda x: x.lower())
df['description'] = df['description'].mapply(some_expensive_computation)
return df
In [3]:
start = datetime.datetime.now()
df1 = apply_transform(df)
end = datetime.datetime.now()
print("-"*44)
print("Execution Time: {} ".format(end-start))
print("-"*44)
Method 3¶
Use Dask plugin¶
Method 4¶
In [1]:
from multiprocessing import Pool
import numpy as np
import os
import json
import pandas as pd
import tqdm
import datetime
import time
import mapply
import time
from multiprocessing import Pool
df = pd.read_csv("netflix_titles.csv")
df = df.dropna()
In [2]:
def apply_transform(df):
df['title'] = df['title'].mapply(lambda x: x.lower())
df['director'] = df['director'].mapply(lambda x: x.lower())
df['cast'] = df['cast'].mapply(lambda x: x.lower())
return df
In [3]:
def parallelize_dataframe(df, func, n_cores=4):
df_split = np.array_split(df, n_cores)
pool = Pool(n_cores)
df = pd.concat(pool.map(func, df_split))
pool.close()
pool.join()
return df
In [ ]:
df = parallelize_dataframe(df, apply_transform)
Wynn casino opening delayed for several years
ReplyDeleteATLANTIC CITY (AP) — The casino operator's 태백 출장샵 opening 계룡 출장샵 time on Jan. 익산 출장마사지 22 was due to be delayed until Jan. 27 의정부 출장샵 due 의왕 출장샵 to concerns that staff may be