Tuesday, August 13, 2019

Choosing Correct K Value for kneighbors Clustering Algorithm

Choosing Correct K Value for kneighbors Clustering Algorithm

Choosing Correct K Value for kneighbors Clustering Algorithm

Soumil Nitin Shah

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |

Hello! I’m Soumil Nitin Shah, a Software and Hardware Developer based in New York City. I have completed by Bachelor in Electronic Engineering and my Double master’s in Computer and Electrical Engineering. I Develop Python Based Cross Platform Desktop Application , Webpages , Software, REST API, Database and much more I have more than 2 Years of Experience in Pythonm

Step 1:

Import all Modules

In [17]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

import scikitplot as skplt
%matplotlib inline
In [7]:
df = pd.read_csv('KNN_Project_Data')
df.head(2)
Out[7]:
XVPM GWYH TRAT TLLZ IGGA HYKR EDFS GUUB MGJM JHZC TARGET CLASS
0 1636.670614 817.988525 2565.995189 358.347163 550.417491 1618.870897 2147.641254 330.727893 1494.878631 845.136088 0
1 1013.402760 577.587332 2644.141273 280.428203 1161.873391 2084.107872 853.404981 447.157619 1193.032521 861.081809 1

Step 2:

Do a Train Test Split

In [14]:
df = pd.read_csv('KNN_Project_Data')
X_Data = df[['XVPM', 'GWYH', 'TRAT', 'TLLZ', 'IGGA', 'HYKR', 'EDFS', 'GUUB', 'MGJM','JHZC']]
Y_Data = df ['TARGET CLASS']
In [15]:
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X_Data, Y_Data, test_size=0.4, random_state=101)

Choose K values

In [19]:
errorrate = []

for i in range(1,50):
    newmodel = KNeighborsClassifier(n_neighbors = i)
    newmodel.fit(X_Train, Y_Train)
    pred = newmodel.predict(X_Test)
    errorrate.append(np.mean(pred != Y_Test))
            
plt.figure(figsize=(10,6))
plt.plot(range(1, 50), errorrate, color='blue', linestyle='dashed', marker='o', markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
Out[19]:
Text(0, 0.5, 'Error Rate')

Create a model

In [20]:
model = KNeighborsClassifier(n_neighbors=58)
In [21]:
model.fit(X_Train, Y_Train)
Out[21]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=58, p=2,
           weights='uniform')
In [22]:
pred = model.predict(X_Test)
In [23]:
pred
Out[23]:
array([1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0,
       1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 1, 1, 1])
In [24]:
confusion_matrix(Y_Test,pred)
Out[24]:
array([[158,  41],
       [ 34, 167]])
In [26]:
print(classification_report(Y_Test, pred))
              precision    recall  f1-score   support

           0       0.82      0.79      0.81       199
           1       0.80      0.83      0.82       201

   micro avg       0.81      0.81      0.81       400
   macro avg       0.81      0.81      0.81       400
weighted avg       0.81      0.81      0.81       400

Developer Guide: Getting Started with Flink (PyFlink) and Hudi - Setting Up Your Local Environment and Performing CRUD Operations via flink

flink-hudi-final Install Flink and Python ¶ conda info --envs # Create ENV conda ...