Tuesday, August 13, 2019

Choosing Correct K Value for Kmean Clustering Algorithm

Choose K value

Choosing Correct K Value for Kmean Clustering Algorithm

Soumil Nitin Shah

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |

Hello! I’m Soumil Nitin Shah, a Software and Hardware Developer based in New York City. I have completed by Bachelor in Electronic Engineering and my Double master’s in Computer and Electrical Engineering. I Develop Python Based Cross Platform Desktop Application , Webpages , Software, REST API, Database and much more I have more than 2 Years of Experience in Python

In [46]:
from IPython.display import YouTubeVideo
YouTubeVideo('Q_u3Rak8yyQ')
Out[46]:

Step 1:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

%matplotlib inline
In [2]:
df = pd.read_csv("AEP_hourly.csv")
In [3]:
df.head(2)
Out[3]:
Datetime AEP_MW
0 2004-12-31 01:00:00 13478.0
1 2004-12-31 02:00:00 12865.0

Step 2:

Extract Features

In [4]:
# Extract all Data Like Year MOnth Day Time etc
dataset = df
dataset["Month"] = pd.to_datetime(df["Datetime"]).dt.month
dataset["Year"] = pd.to_datetime(df["Datetime"]).dt.year
dataset["Date"] = pd.to_datetime(df["Datetime"]).dt.date
dataset["Time"] = pd.to_datetime(df["Datetime"]).dt.time
dataset["Week"] = pd.to_datetime(df["Datetime"]).dt.week
dataset["Day"] = pd.to_datetime(df["Datetime"]).dt.day_name()
dataset = df.set_index("Datetime")
dataset.index = pd.to_datetime(dataset.index)
dataset.head(1)
Out[4]:
AEP_MW Month Year Date Time Week Day
Datetime
2004-12-31 01:00:00 13478.0 12 2004 2004-12-31 01:00:00 53 Friday

Step 3:

Resample Data

In [10]:
NewDataSet = dataset.resample('D').mean()
NewDataSet.head(2)
Out[10]:
AEP_MW Month Year Week
Datetime
2004-10-01 14284.521739 10 2004 40
2004-10-02 12999.875000 10 2004 40

Create Dataset

In [14]:
x = dataset["Time"]
y = dataset["AEP_MW"]

y = y.values.reshape(-1,1)

print(y.shape)
(121273, 1)

Step 4:

Choosing the Correct K Values

In [25]:
y = dataset["AEP_MW"]
x = dataset["Time"]

y = y.values.reshape(-1,1)

error = []
k = []
for counter, i in enumerate(range(1,15)):
    # Unsupervised Learning 
    kmean = KMeans(n_clusters=i)
    kmean.fit(y)
    error.append(kmean.inertia_)
    k.append(counter)
    
plt.plot(k, error, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.grid(True)
plt.show()
    
    

CONCLUSION :

WE CAN SEE THAT CORRECT K VALUES SHOULD BE 2 OR 3 OR 4 YOU SHOULD NOT GO ABOVE 4 AS IT WOULD BE VERY DIFFICULT TO DETECT OUTLIERS

Create Model with Correct K Values

In [34]:
kmean = KMeans(n_clusters=2,init='k-means++',max_iter=400,random_state=101)
In [37]:
y = dataset["AEP_MW"].values.reshape(-1,1)
x = dataset["Time"].values

kmean.fit(y)
Out[37]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=400,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=101, tol=0.0001, verbose=0)
In [38]:
kmean.labels_
Out[38]:
array([1, 1, 1, ..., 0, 0, 0], dtype=int32)
In [39]:
dataset["Cluster"] = kmean.labels_
In [40]:
# Take First 2000 POINTS 
x = dataset[dataset["Cluster"] == 0]["AEP_MW"][0:2000].values
y = dataset[dataset["Cluster"] == 1]["AEP_MW"][0:2000].values
cluster = dataset["Cluster"][0:2000].values

# CHECK THE SHAPE 
print(x.shape)
print(y.shape)
print(cluster.shape)
(2000,)
(2000,)
(2000,)
In [41]:
import seaborn as sns

sns.scatterplot(x,y, s=120, hue=cluster)
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a20e903c8>
In [42]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(x,
           y,
           zs=0,
           marker="s", 
           s=40, 
           c = cluster.astype(float),
           depthshade=True,
           label='Cluster 1 and 2')
plt.title("Cluster 1 and Cluster 2")
Out[42]:
Text(0.5, 0.92, 'Cluster 1 and Cluster 2')

No comments:

Post a Comment

Learn How to configure your Spark Session to Join Managed (S3 Table Buckets) and Unmanaged Iceberg Tables | Hands on Labs

test-tble-bucket-joins Learn How to configure your Spark Session to Join Managed (S...