Sunday, April 11, 2021

Machine Learning on Skills Dataset from LinkedIn | Identify Category of Skills | Python

Untitled

Authors

  • Soumil Nitin Shah

Soumil Nitin Shah

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |

Excellent experience of building scalable and high-performance Software Applications combining distinctive skill sets in Internet of Things (IoT), Machine Learning and Full Stack Web Development in Python.

Define Imports
In [26]:
try:

    import json
    import os
    import string
    
    import pandas as  pd
    import numpy as np
    
    import re
    import swifter

    # Import various Models
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.linear_model import SGDClassifier
    from sklearn.svm import LinearSVC


    # Import pre processing libraries
    from sklearn.preprocessing import LabelEncoder

    # Import  Test Modules
    from sklearn.model_selection import cross_val_score, train_test_split

    from sklearn.metrics import classification_report
    
    import seaborn as sns
    import matplotlib as mpl
    
    import torch
    import transformers as ppb # pytorch transformers
    
    import matplotlib.pyplot as plt
except Exception as e:
    print("Error : {} ".format(e))
In [27]:
os.listdir()
Out[27]:
['.ipynb_checkpoints',
 'public_use-industry-skills-needs.xlsx',
 'Untitled.ipynb',
 '~$public_use-industry-skills-needs.xlsx']
In [28]:
df = pd.read_excel("public_use-industry-skills-needs.xlsx",sheet_name='Industry Skills Needs')
In [29]:
df.shape
Out[29]:
(3500, 7)
In [30]:
df.head(2)
Out[30]:
year isic_section_index isic_section_name industry_name skill_group_category skill_group_name skill_group_rank
0 2015 B Mining and quarrying Mining & Metals Specialized Industry Skills Mining 1
1 2015 B Mining and quarrying Mining & Metals Soft Skills Negotiation 2
In [31]:
df['skill_group_category'].value_counts().plot( kind='bar', figsize=(15,10))
plt.title("Skill category Count ")
Out[31]:
Text(0.5, 1.0, 'Skill category Count ')
In [32]:
df["skill_group_category"].nunique()
Out[32]:
5
In [25]:
list(df["skill_group_category"].unique())
Out[25]:
['Specialized Industry Skills',
 'Soft Skills',
 'Business Skills',
 'Tech Skills',
 'Disruptive Tech Skills']
In [34]:
X = df["skill_group_name"]
Y = df["skill_group_category"]

Pre Processing

In [36]:
encoder = LabelEncoder()
Y = encoder.fit_transform(Y)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3)
In [40]:
v = dict(zip(list(Y), df['skill_group_category'].to_list()))
In [41]:
v
Out[41]:
{3: 'Specialized Industry Skills',
 2: 'Soft Skills',
 0: 'Business Skills',
 4: 'Tech Skills',
 1: 'Disruptive Tech Skills'}

BERT for tokenization

In [37]:
class BertTokenizer(object):

    def __init__(self, text=[]):
        self.text = text

        # For DistilBERT:
        self.model_class, self.tokenizer_class, self.pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

        # Load pretrained model/tokenizer
        self.tokenizer = self.tokenizer_class.from_pretrained(self.pretrained_weights)

        self.model = self.model_class.from_pretrained(self.pretrained_weights)

    def get(self):

        df = pd.DataFrame(data={"text":self.text})
        tokenized = df["text"].swifter.apply((lambda x: self.tokenizer.encode(x, add_special_tokens=True)))

        max_len = 0
        for i in tokenized.values:
            if len(i) > max_len:
                max_len = len(i)

        padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

        attention_mask = np.where(padded != 0, 1, 0)
        input_ids = torch.tensor(padded)
        attention_mask = torch.tensor(attention_mask)

        with torch.no_grad(): last_hidden_states = self.model(input_ids, attention_mask=attention_mask)
        
        features = last_hidden_states[0][:, 0, :].numpy()

        return features
In [38]:
_instance =BertTokenizer(text=x_train)
tokens = _instance.get()

Model 1:

LinearSVC

In [42]:
clf_sv = LinearSVC(C=1, class_weight='balanced', multi_class='ovr', random_state=40, max_iter=400)
In [43]:
clf_sv.fit(tokens, y_train)
c:\python38\lib\site-packages\sklearn\svm\_base.py:976: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  warnings.warn("Liblinear failed to converge, increase "
Out[43]:
LinearSVC(C=1, class_weight='balanced', max_iter=400, random_state=40)
Test how good our model is
In [44]:
_instance =BertTokenizer(text=x_test)
tokensTest = _instance.get()

In [45]:
predicted = clf_sv.predict(tokensTest)
In [46]:
np.mean(predicted == y_test)
Out[46]:
1.0
Almost 100 %

Model 2:

In [47]:
from sklearn.neighbors import KNeighborsClassifier
In [49]:
errorrate = []
for i in range(1,60, 10):
    print(i)
    newmodel = KNeighborsClassifier(n_neighbors = i)
    newmodel.fit(tokens, y_train)
    pred = newmodel.predict(tokensTest)
    errorrate.append(np.mean(pred != y_test))
1
11
21
31
41
51
In [50]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(10,6))
plt.plot(range(1, 60,10), errorrate, color='blue', linestyle='dashed', marker='o', markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.show()

KNN proper K value would be around 2-8

In [51]:
newmodel = KNeighborsClassifier(n_neighbors = 4)
newmodel.fit(tokens, y_train)
pred = newmodel.predict(tokensTest)
np.mean(predicted == y_test)
Out[51]:
1.0
In [52]:
y_test
Out[52]:
array([0, 0, 2, ..., 2, 4, 2])

Great

Test on unseen Data

In [63]:
SkillName = "Python"
In [64]:
_instance =BertTokenizer(text=[SkillName])
tokens_ = _instance.get()

In [65]:
tokens_.shape
Out[65]:
(1, 768)
In [66]:
pred = newmodel.predict(tokens_)
In [67]:
list(pred)
Out[67]:
[1]
In [68]:
v[list(pred)[0]]
Out[68]:
'Disruptive Tech Skills'

Wooohh !!!!

In [ ]:
 

Learn How to configure your Spark Session to Join Managed (S3 Table Buckets) and Unmanaged Iceberg Tables | Hands on Labs

test-tble-bucket-joins Learn How to configure your Spark Session to Join Managed (S...