Choosing Correct K Value for Kmean Clustering Algorithm¶

Soumil Nitin Shah¶

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |

Website : https://soumilshah.herokuapp.com
Github: https://github.com/soumilshah1995
Linkedin: https://www.linkedin.com/in/shah-soumil/
Blog: https://soumilshah1995.blogspot.com/
Youtube : https://www.youtube.com/channel/UC_eOodxvwS_H7x2uLQa-svw?view_as=subscriber
Facebook Page : https://www.facebook.com/soumilshah1995/
Email : shahsoumil519@gmail.com
projects : https://soumilshah.herokuapp.com/project

Hello! I’m Soumil Nitin Shah, a Software and Hardware Developer based in New York City. I have completed by Bachelor in Electronic Engineering and my Double master’s in Computer and Electrical Engineering. I Develop Python Based Cross Platform Desktop Application , Webpages , Software, REST API, Database and much more I have more than 2 Years of Experience in Python

from IPython.display import YouTubeVideo
YouTubeVideo('Q_u3Rak8yyQ')

Step 1:¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

%matplotlib inline

df = pd.read_csv("AEP_hourly.csv")

df.head(2)

Step 2:¶

Extract Features¶

# Extract all Data Like Year MOnth Day Time etc
dataset = df
dataset["Month"] = pd.to_datetime(df["Datetime"]).dt.month
dataset["Year"] = pd.to_datetime(df["Datetime"]).dt.year
dataset["Date"] = pd.to_datetime(df["Datetime"]).dt.date
dataset["Time"] = pd.to_datetime(df["Datetime"]).dt.time
dataset["Week"] = pd.to_datetime(df["Datetime"]).dt.week
dataset["Day"] = pd.to_datetime(df["Datetime"]).dt.day_name()
dataset = df.set_index("Datetime")
dataset.index = pd.to_datetime(dataset.index)
dataset.head(1)

Step 3:¶

Resample Data¶

NewDataSet = dataset.resample('D').mean()
NewDataSet.head(2)

Create Dataset¶

x = dataset["Time"]
y = dataset["AEP_MW"]

y = y.values.reshape(-1,1)

print(y.shape)

(121273, 1)

Step 4:¶

Choosing the Correct K Values¶

y = dataset["AEP_MW"]
x = dataset["Time"]

y = y.values.reshape(-1,1)

error = []
k = []
for counter, i in enumerate(range(1,15)):
    # Unsupervised Learning 
    kmean = KMeans(n_clusters=i)
    kmean.fit(y)
    error.append(kmean.inertia_)
    k.append(counter)
    
plt.plot(k, error, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.grid(True)
plt.show()

CONCLUSION :¶

WE CAN SEE THAT CORRECT K VALUES SHOULD BE 2 OR 3 OR 4 YOU SHOULD NOT GO ABOVE 4 AS IT WOULD BE VERY DIFFICULT TO DETECT OUTLIERS¶

Create Model with Correct K Values¶

kmean = KMeans(n_clusters=2,init='k-means++',max_iter=400,random_state=101)

y = dataset["AEP_MW"].values.reshape(-1,1)
x = dataset["Time"].values

kmean.fit(y)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=400,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=101, tol=0.0001, verbose=0)

kmean.labels_

array([1, 1, 1, ..., 0, 0, 0], dtype=int32)

dataset["Cluster"] = kmean.labels_

# Take First 2000 POINTS 
x = dataset[dataset["Cluster"] == 0]["AEP_MW"][0:2000].values
y = dataset[dataset["Cluster"] == 1]["AEP_MW"][0:2000].values
cluster = dataset["Cluster"][0:2000].values

# CHECK THE SHAPE 
print(x.shape)
print(y.shape)
print(cluster.shape)

(2000,)
(2000,)
(2000,)

import seaborn as sns

sns.scatterplot(x,y, s=120, hue=cluster)

<matplotlib.axes._subplots.AxesSubplot at 0x1a20e903c8>

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(x,
           y,
           zs=0,
           marker="s", 
           s=40, 
           c = cluster.astype(float),
           depthshade=True,
           label='Cluster 1 and 2')
plt.title("Cluster 1 and Cluster 2")

Text(0.5, 0.92, 'Cluster 1 and Cluster 2')

	Datetime	AEP_MW
0	2004-12-31 01:00:00	13478.0
1	2004-12-31 02:00:00	12865.0

Pythonist

Tuesday, August 13, 2019

Choosing Correct K Value for Kmean Clustering Algorithm