Sunday, April 11, 2021

Machine Learning on Skills Dataset from LinkedIn | Identify Category of Skills | Python

Untitled

Authors

  • Soumil Nitin Shah

Soumil Nitin Shah

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |

Excellent experience of building scalable and high-performance Software Applications combining distinctive skill sets in Internet of Things (IoT), Machine Learning and Full Stack Web Development in Python.

Define Imports
In [26]:
try:

    import json
    import os
    import string
    
    import pandas as  pd
    import numpy as np
    
    import re
    import swifter

    # Import various Models
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.linear_model import SGDClassifier
    from sklearn.svm import LinearSVC


    # Import pre processing libraries
    from sklearn.preprocessing import LabelEncoder

    # Import  Test Modules
    from sklearn.model_selection import cross_val_score, train_test_split

    from sklearn.metrics import classification_report
    
    import seaborn as sns
    import matplotlib as mpl
    
    import torch
    import transformers as ppb # pytorch transformers
    
    import matplotlib.pyplot as plt
except Exception as e:
    print("Error : {} ".format(e))
In [27]:
os.listdir()
Out[27]:
['.ipynb_checkpoints',
 'public_use-industry-skills-needs.xlsx',
 'Untitled.ipynb',
 '~$public_use-industry-skills-needs.xlsx']
In [28]:
df = pd.read_excel("public_use-industry-skills-needs.xlsx",sheet_name='Industry Skills Needs')
In [29]:
df.shape
Out[29]:
(3500, 7)
In [30]:
df.head(2)
Out[30]:
year isic_section_index isic_section_name industry_name skill_group_category skill_group_name skill_group_rank
0 2015 B Mining and quarrying Mining & Metals Specialized Industry Skills Mining 1
1 2015 B Mining and quarrying Mining & Metals Soft Skills Negotiation 2
In [31]:
df['skill_group_category'].value_counts().plot( kind='bar', figsize=(15,10))
plt.title("Skill category Count ")
Out[31]:
Text(0.5, 1.0, 'Skill category Count ')
In [32]:
df["skill_group_category"].nunique()
Out[32]:
5
In [25]:
list(df["skill_group_category"].unique())
Out[25]:
['Specialized Industry Skills',
 'Soft Skills',
 'Business Skills',
 'Tech Skills',
 'Disruptive Tech Skills']
In [34]:
X = df["skill_group_name"]
Y = df["skill_group_category"]

Pre Processing

In [36]:
encoder = LabelEncoder()
Y = encoder.fit_transform(Y)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3)
In [40]:
v = dict(zip(list(Y), df['skill_group_category'].to_list()))
In [41]:
v
Out[41]:
{3: 'Specialized Industry Skills',
 2: 'Soft Skills',
 0: 'Business Skills',
 4: 'Tech Skills',
 1: 'Disruptive Tech Skills'}

BERT for tokenization

In [37]:
class BertTokenizer(object):

    def __init__(self, text=[]):
        self.text = text

        # For DistilBERT:
        self.model_class, self.tokenizer_class, self.pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

        # Load pretrained model/tokenizer
        self.tokenizer = self.tokenizer_class.from_pretrained(self.pretrained_weights)

        self.model = self.model_class.from_pretrained(self.pretrained_weights)

    def get(self):

        df = pd.DataFrame(data={"text":self.text})
        tokenized = df["text"].swifter.apply((lambda x: self.tokenizer.encode(x, add_special_tokens=True)))

        max_len = 0
        for i in tokenized.values:
            if len(i) > max_len:
                max_len = len(i)

        padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

        attention_mask = np.where(padded != 0, 1, 0)
        input_ids = torch.tensor(padded)
        attention_mask = torch.tensor(attention_mask)

        with torch.no_grad(): last_hidden_states = self.model(input_ids, attention_mask=attention_mask)
        
        features = last_hidden_states[0][:, 0, :].numpy()

        return features
In [38]:
_instance =BertTokenizer(text=x_train)
tokens = _instance.get()

Model 1:

LinearSVC

In [42]:
clf_sv = LinearSVC(C=1, class_weight='balanced', multi_class='ovr', random_state=40, max_iter=400)
In [43]:
clf_sv.fit(tokens, y_train)
c:\python38\lib\site-packages\sklearn\svm\_base.py:976: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn("Liblinear failed to converge, increase "
Out[43]:
LinearSVC(C=1, class_weight='balanced', max_iter=400, random_state=40)
Test how good our model is
In [44]:
_instance =BertTokenizer(text=x_test)
tokensTest = _instance.get()

In [45]:
predicted = clf_sv.predict(tokensTest)
In [46]:
np.mean(predicted == y_test)
Out[46]:
1.0
Almost 100 %

Model 2:

In [47]:
from sklearn.neighbors import KNeighborsClassifier
In [49]:
errorrate = []
for i in range(1,60, 10):
    print(i)
    newmodel = KNeighborsClassifier(n_neighbors = i)
    newmodel.fit(tokens, y_train)
    pred = newmodel.predict(tokensTest)
    errorrate.append(np.mean(pred != y_test))
1
11
21
31
41
51
In [50]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(10,6))
plt.plot(range(1, 60,10), errorrate, color='blue', linestyle='dashed', marker='o', markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.show()

KNN proper K value would be around 2-8

In [51]:
newmodel = KNeighborsClassifier(n_neighbors = 4)
newmodel.fit(tokens, y_train)
pred = newmodel.predict(tokensTest)
np.mean(predicted == y_test)
Out[51]:
1.0
In [52]:
y_test
Out[52]:
array([0, 0, 2, ..., 2, 4, 2])

Great

Test on unseen Data

In [63]:
SkillName = "Python"
In [64]:
_instance =BertTokenizer(text=[SkillName])
tokens_ = _instance.get()

In [65]:
tokens_.shape
Out[65]:
(1, 768)
In [66]:
pred = newmodel.predict(tokens_)
In [67]:
list(pred)
Out[67]:
[1]
In [68]:
v[list(pred)[0]]
Out[68]:
'Disruptive Tech Skills'

Wooohh !!!!

In [ ]:
 

No comments:

Post a Comment

Learn How to Connect to the Glue Data Catalog using AWS Glue Iceberg REST endpoint

gluecat Learn How to Connect to the Glue Data Catalog using AWS Glue Iceberg REST e...