Tuesday, August 13, 2019

Choosing Correct K Value for Kmean Clustering Algorithm

Choose K value

Choosing Correct K Value for Kmean Clustering Algorithm

Soumil Nitin Shah

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |

Hello! I’m Soumil Nitin Shah, a Software and Hardware Developer based in New York City. I have completed by Bachelor in Electronic Engineering and my Double master’s in Computer and Electrical Engineering. I Develop Python Based Cross Platform Desktop Application , Webpages , Software, REST API, Database and much more I have more than 2 Years of Experience in Python

In [46]:
from IPython.display import YouTubeVideo
YouTubeVideo('Q_u3Rak8yyQ')
Out[46]:

Step 1:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans

%matplotlib inline
In [2]:
df = pd.read_csv("AEP_hourly.csv")
In [3]:
df.head(2)
Out[3]:
Datetime AEP_MW
0 2004-12-31 01:00:00 13478.0
1 2004-12-31 02:00:00 12865.0

Step 2:

Extract Features

In [4]:
# Extract all Data Like Year MOnth Day Time etc
dataset = df
dataset["Month"] = pd.to_datetime(df["Datetime"]).dt.month
dataset["Year"] = pd.to_datetime(df["Datetime"]).dt.year
dataset["Date"] = pd.to_datetime(df["Datetime"]).dt.date
dataset["Time"] = pd.to_datetime(df["Datetime"]).dt.time
dataset["Week"] = pd.to_datetime(df["Datetime"]).dt.week
dataset["Day"] = pd.to_datetime(df["Datetime"]).dt.day_name()
dataset = df.set_index("Datetime")
dataset.index = pd.to_datetime(dataset.index)
dataset.head(1)
Out[4]:
AEP_MW Month Year Date Time Week Day
Datetime
2004-12-31 01:00:00 13478.0 12 2004 2004-12-31 01:00:00 53 Friday

Step 3:

Resample Data

In [10]:
NewDataSet = dataset.resample('D').mean()
NewDataSet.head(2)
Out[10]:
AEP_MW Month Year Week
Datetime
2004-10-01 14284.521739 10 2004 40
2004-10-02 12999.875000 10 2004 40

Create Dataset

In [14]:
x = dataset["Time"]
y = dataset["AEP_MW"]

y = y.values.reshape(-1,1)

print(y.shape)
(121273, 1)

Step 4:

Choosing the Correct K Values

In [25]:
y = dataset["AEP_MW"]
x = dataset["Time"]

y = y.values.reshape(-1,1)

error = []
k = []
for counter, i in enumerate(range(1,15)):
    # Unsupervised Learning 
    kmean = KMeans(n_clusters=i)
    kmean.fit(y)
    error.append(kmean.inertia_)
    k.append(counter)
    
plt.plot(k, error, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.grid(True)
plt.show()
    
    

CONCLUSION :

WE CAN SEE THAT CORRECT K VALUES SHOULD BE 2 OR 3 OR 4 YOU SHOULD NOT GO ABOVE 4 AS IT WOULD BE VERY DIFFICULT TO DETECT OUTLIERS

Create Model with Correct K Values

In [34]:
kmean = KMeans(n_clusters=2,init='k-means++',max_iter=400,random_state=101)
In [37]:
y = dataset["AEP_MW"].values.reshape(-1,1)
x = dataset["Time"].values

kmean.fit(y)
Out[37]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=400,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=101, tol=0.0001, verbose=0)
In [38]:
kmean.labels_
Out[38]:
array([1, 1, 1, ..., 0, 0, 0], dtype=int32)
In [39]:
dataset["Cluster"] = kmean.labels_
In [40]:
# Take First 2000 POINTS 
x = dataset[dataset["Cluster"] == 0]["AEP_MW"][0:2000].values
y = dataset[dataset["Cluster"] == 1]["AEP_MW"][0:2000].values
cluster = dataset["Cluster"][0:2000].values

# CHECK THE SHAPE 
print(x.shape)
print(y.shape)
print(cluster.shape)
(2000,)
(2000,)
(2000,)
In [41]:
import seaborn as sns

sns.scatterplot(x,y, s=120, hue=cluster)
Out[41]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a20e903c8>
In [42]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D


fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(x,
           y,
           zs=0,
           marker="s", 
           s=40, 
           c = cluster.astype(float),
           depthshade=True,
           label='Cluster 1 and 2')
plt.title("Cluster 1 and Cluster 2")
Out[42]:
Text(0.5, 0.92, 'Cluster 1 and Cluster 2')

No comments:

Post a Comment

Learn How to Connect to the Glue Data Catalog using AWS Glue Iceberg REST endpoint

gluecat Learn How to Connect to the Glue Data Catalog using AWS Glue Iceberg REST e...