Entity Recognition Extract information from Job posting¶

Soumil Nitin Shah¶

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |

Website : https://soumilshah.herokuapp.com
Github: https://github.com/soumilshah1995
Linkedin: https://www.linkedin.com/in/shah-soumil/
Blog: https://soumilshah1995.blogspot.com/
Youtube : https://www.youtube.com/channel/UC_eOodxvwS_H7x2uLQa-svw?view_as=subscriber
Facebook Page : https://www.facebook.com/soumilshah1995/
Email : shahsoumil519@gmail.com
projects : https://soumilshah.herokuapp.com/project

Excellent experience of building scalable and high-performance Software Applications combining distinctive skill sets in Internet of Things (IoT), Machine Learning and Full Stack Web Development in Python.

Dataset download link https://www.kaggle.com/madhab/jobposts ¶

Step 1:¶

define imports

import pandas as pd
import os
import sys
import re
import ast
from ast import literal_eval
import plac
import random
from pathlib import Path
import spacy
import re
from spacy.util import minibatch, compounding
import seaborn as sns

df = pd.read_csv("data job posts.csv")

sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')

<matplotlib.axes._subplots.AxesSubplot at 0x21fa75601f0>

df.head(1)

df["Title"].nunique()

8636

df.describe()

there are 8636 uniqe job title in dataset

Step 2:¶

Data Exploration

df.tail(2)

Training Data¶

class TrainDataGenerator(object):
    def __init__(self, text):
        self.text = text
        self.entities = []

    def add_entity(self, searchTerm = '', entity_name=''):
        try:
            response = re.search(searchTerm, self.text)
            data_entity = (response.start(), response.end(), entity_name)
            self.entities.append(data_entity)
        except Exception as e:pass

    def complete_entity(self):

        entity_tem = {"entities":self.entities}
        data = (self.text, entity_tem)

        entites = entity_tem.get("entities")

        # check if entity first index is overlapping with another one
        for i in range(0, len(entites)):
            for j in range(i+1, len(entites)):


                StartIndex1 = entites[i][0]
                endIndex1 = entites[i][1]

                StartIndex2 = entites[j][0]
                endIndex2 = entites[j][1]

                if StartIndex1 in range(StartIndex2, endIndex2):
                    return False
                if endIndex2 in range(StartIndex2, endIndex2):
                    return False

        return data

TRAIN_DATA = []

df.columns

Index(['jobpost', 'date', 'Title', 'Company', 'AnnouncementCode', 'Term',
       'Eligibility', 'Audience', 'StartDate', 'Duration', 'Location',
       'JobDescription', 'JobRequirment', 'RequiredQual', 'Salary',
       'ApplicationP', 'OpeningDate', 'Deadline', 'Notes', 'AboutC', 'Attach',
       'Year', 'Month', 'IT'],
      dtype='object')

for x in df[["jobpost", "Title", "Company", "Salary", "Eligibility", "Duration"]].iterrows():
    
    text = x[1].jobpost
        
    _helper = TrainDataGenerator(text=text)
    
    # Title Training Data
    _helper.add_entity(entity_name='Title', searchTerm= x[1].Title)
    
    # Company Training data 
    _helper.add_entity(entity_name='Company', searchTerm=x[1].Company)
    
    
    # Salary Training data 
    _helper.add_entity(entity_name='Salary', searchTerm=x[1].Salary)
    
    # Eligibility Training data 
    _helper.add_entity(entity_name='Eligibility', searchTerm=x[1].Eligibility)

    # Duration Training data 
    _helper.add_entity(entity_name='Eligibility', searchTerm=x[1].Duration)
    
    response = _helper.complete_entity()
    if response == False:
        pass
    else:
        TRAIN_DATA.append(response)

len(TRAIN_DATA)

18993

TRAIN_DATA = TRAIN_DATA[::100]

len(TRAIN_DATA)

190

Model¶

class Model(object):

    def __init__(self, modelName = "testmodel"):
        self.nlp = spacy.blank("en")
        self.modelName = modelName

    def train(self, output_dir=None, n_iter=80):

        # create the built-in pipeline components and add them to the pipeline
        # nlp.create_pipe works for built-ins that are registered with spaCy
        if "ner" not in self.nlp.pipe_names:
            ner = self.nlp.create_pipe("ner")
            self.nlp.add_pipe(ner, last=True)

        # otherwise, get it so we can add labels
        else:
            ner = self.nlp.get_pipe("ner")

        # add labels
        for _, annotations in TRAIN_DATA:
            for ent in annotations.get("entities"):
                ner.add_label(ent[2])

        # get names of other pipes to disable them during training
        pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
        other_pipes = [pipe
                       for pipe in self.nlp.pipe_names if pipe not in pipe_exceptions]

        with self.nlp.disable_pipes(*other_pipes):  # only train NER
            # reset and initialize the weights randomly – but only if we're
            # training a new model
            self.nlp.begin_training()

            for itn in range(n_iter):
                print("Iteration : {} ".format(itn))
                random.shuffle(TRAIN_DATA)

                losses = {}
                # batch up the examples using spaCy's minibatch

                batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))

                for batch in batches:
                    texts, annotations = zip(*batch)
                    self.nlp.update(
                        texts,              # batch of texts
                        annotations,        # batch of annotations
                        drop=0.5,           # dropout - make it harder to memorise data
                        losses=losses,
                    )

                print("Losses", losses)

            self.nlp.to_disk(self.modelName)
            print('Model has been trained and saved on your disk ')
            print("use nlp = spacy.load(NAME) ")
            print("\n")

def main():
    train = Model(modelName='jobposting')
    response = train.train()

main()

c:\python38\lib\site-packages\spacy\language.py:635: UserWarning: [W033] Training a new parser or NER using a model with no lexeme normalization table. This may degrade the performance of the model to some degree. If this is intentional or the language you're using doesn't have a normalization table, please ignore this warning. If this is surprising, make sure you have the spacy-lookups-data package installed. The languages with lexeme normalization tables are currently: da, de, el, en, id, lb, pt, ru, sr, ta, th.
  proc.begin_training(

Iteration : 0

c:\python38\lib\site-packages\spacy\language.py:482: UserWarning: [W030] Some entities could not be aligned in the text ""Motion Time" Advertising Company
TITLE:  Adverti..." with entities "[(43, 60, 'Title'), (0, 33, 'Company'), (513, 517,...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
  gold = GoldParse(doc, **gold)

Losses {'ner': 18380.97104827072}
Iteration : 1 
Losses {'ner': 4699.670381930318}
Iteration : 2 
Losses {'ner': 9878.842687040102}
Iteration : 3 
Losses {'ner': 9121.51499587028}
Iteration : 4 
Losses {'ner': 11544.020995071827}
Iteration : 5 
Losses {'ner': 10610.450039949035}
Iteration : 6 
Losses {'ner': 10369.691868495196}
Iteration : 7 
Losses {'ner': 8570.033112825418}
Iteration : 8 
Losses {'ner': 9381.052244767467}
Iteration : 9 
Losses {'ner': 7903.03582038579}
Iteration : 10 
Losses {'ner': 9444.623359981575}
Iteration : 11 
Losses {'ner': 8748.988276667384}
Iteration : 12 
Losses {'ner': 7541.176756621659}
Iteration : 13 
Losses {'ner': 6832.486853603837}
Iteration : 14 
Losses {'ner': 6947.726805507671}
Iteration : 15 
Losses {'ner': 5296.005261249757}
Iteration : 16 
Losses {'ner': 6297.691770431446}
Iteration : 17 
Losses {'ner': 3835.2003134659217}
Iteration : 18 
Losses {'ner': 4093.5110076406318}
Iteration : 19 
Losses {'ner': 3160.8572605272157}
Iteration : 20 
Losses {'ner': 3101.293733747376}
Iteration : 21 
Losses {'ner': 3007.217087266718}
Iteration : 22 
Losses {'ner': 3091.8537776147673}
Iteration : 23 
Losses {'ner': 2713.7856922169394}
Iteration : 24 
Losses {'ner': 3289.6989442914437}
Iteration : 25 
Losses {'ner': 2615.1358862667}
Iteration : 26 
Losses {'ner': 2200.379158669518}
Iteration : 27 
Losses {'ner': 1675.5066180808324}
Iteration : 28 
Losses {'ner': 1679.2881498131746}
Iteration : 29 
Losses {'ner': 2062.0508638124707}
Iteration : 30 
Losses {'ner': 1573.6250076905765}
Iteration : 31 
Losses {'ner': 1345.485321863029}
Iteration : 32 
Losses {'ner': 1594.848346285836}
Iteration : 33 
Losses {'ner': 2043.3384764484704}
Iteration : 34 
Losses {'ner': 2064.703963995858}
Iteration : 35 
Losses {'ner': 1136.830656361065}
Iteration : 36 
Losses {'ner': 1776.7798312752557}
Iteration : 37 
Losses {'ner': 1267.1696598067747}
Iteration : 38 
Losses {'ner': 984.2493475861796}
Iteration : 39 
Losses {'ner': 606.7264764451215}
Iteration : 40 
Losses {'ner': 756.7257394838367}
Iteration : 41 
Losses {'ner': 617.7859206620932}
Iteration : 42 
Losses {'ner': 673.6303049232835}
Iteration : 43 
Losses {'ner': 456.09631514137226}
Iteration : 44 
Losses {'ner': 1205.5190171253353}
Iteration : 45 
Losses {'ner': 662.2341710073242}
Iteration : 46 
Losses {'ner': 507.36188621454585}
Iteration : 47 
Losses {'ner': 610.4899416853752}
Iteration : 48 
Losses {'ner': 877.5788984592551}
Iteration : 49 
Losses {'ner': 486.9110064183544}
Iteration : 50 
Losses {'ner': 1130.0018569109063}
Iteration : 51 
Losses {'ner': 460.45876164825495}
Iteration : 52 
Losses {'ner': 590.1637958558758}
Iteration : 53 
Losses {'ner': 313.933794949397}
Iteration : 54 
Losses {'ner': 489.32273312699454}
Iteration : 55 
Losses {'ner': 341.2340635119493}
Iteration : 56 
Losses {'ner': 355.16430260488954}
Iteration : 57 
Losses {'ner': 250.98433943584354}
Iteration : 58 
Losses {'ner': 476.87904149085074}
Iteration : 59 
Losses {'ner': 399.2463960835315}
Iteration : 60 
Losses {'ner': 583.6334809564943}
Iteration : 61 
Losses {'ner': 391.8574076966954}
Iteration : 62 
Losses {'ner': 523.1551393680376}
Iteration : 63 
Losses {'ner': 334.6856271845277}
Iteration : 64 
Losses {'ner': 316.45389223542685}
Iteration : 65 
Losses {'ner': 414.64109182738804}
Iteration : 66 
Losses {'ner': 412.9363883212921}
Iteration : 67 
Losses {'ner': 211.245532639828}
Iteration : 68 
Losses {'ner': 212.5565848767217}
Iteration : 69 
Losses {'ner': 228.1667984008269}
Iteration : 70 
Losses {'ner': 262.8385774929472}
Iteration : 71 
Losses {'ner': 373.82818237721585}
Iteration : 72 
Losses {'ner': 159.02358136375446}
Iteration : 73 
Losses {'ner': 206.6122921309186}
Iteration : 74 
Losses {'ner': 380.0435376501302}
Iteration : 75 
Losses {'ner': 152.6402089288273}
Iteration : 76 
Losses {'ner': 283.5626839841529}
Iteration : 77 
Losses {'ner': 119.18966887420726}
Iteration : 78 
Losses {'ner': 129.56188815885506}
Iteration : 79 
Losses {'ner': 202.87337789039807}
Model has been trained and saved on your disk 
use nlp = spacy.load(NAME)

Test¶

nlp = spacy.load("jobposting")

text = """
'Aldo\r\nTITLE:  Retail Merchandiser\r\nSTART DATE/ TIME:  Immediate employment\r\nDURATION:  Long-term\r\nLOCATION:  Yerevan, Armenia\r\nJOB DESCRIPTION:  Aldo is seeking a Retail Merchandiser to drive maximum\r\nprofitability through planning stock intake to meet budgeted sales, build\r\nrelationships and work effectively with the host brand teams.\r\nJOB RESPONSIBILITIES:\r\n- Maximize and achieve revenue and profitability targets through\r\neffective merchandise planning and selection, product strategy and\r\nplanning, pricing, promotions, inventory control and vendor management;\r\n- Manage the buying budget and process, product pricing and margin\r\nmanagement;\r\n- Control the stock management and flow planning of all incoming product\r\nlines against monthly and annual budgets that includes for building and\r\nsupporting commercial strategies and trading plans;\r\n- Create seasonal sales and buying plans in order to maximize commercial\r\nopportunity and which meets profit objectives;\r\n- Work closely with marketing and operations teams to develop\r\nadvertisement, and sales promotions as well as arrangement of product\r\ncategories to adjust store inventory levels;\r\n- Supervise instore Visual Presentation, re-layout and re-merchandising;\r\n- Responsible for in-store visual merchandisers development and talent\r\nmanagement in order to achieve merchandising business objectives.\r\nREQUIRED QUALIFICATIONS:\r\n- University Degree in Business Administration, Finance or Marketing; \r\n- At least 2 years of work experience in financial analytics,\r\nmerchandising or product management;\r\n- Relevant work experience in retail organization or environment would be\r\nan added advantage;\r\n- Excellent verbal and written communication skills in English and\r\nArmenian languages; \r\n- Proven ability to motivate others; \r\n- Excellent analytical and numerical skill;\r\n- Strong entrepreneurial spirit with a passionate commitment to the\r\ncustomer and product quality;\r\n- Strong team player with good people management and strong leadership\r\nqualities with the ability to work with people of all levels;\r\n- Willingness to travel occasionally;\r\n- PC literacy.\r\nREMUNERATION/ SALARY:  Highly competitive\r\nAPPLICATION PROCEDURES:  Interested candidates are encouraged to submit a\r\nCV to: hr.franchise@... with a note of " Retail Merchandiser " in\r\nthe subject line or call 52 57 22 for inquiries. The Group thanks all who\r\nexpress interest in this opportunity; however only those selected for an\r\ninterview will be contacted. Applications privacy and confidentiality are\r\nguaranteed.\r\nPlease clearly mention in your application letter that you learned of\r\nthis job opportunity through Career Center and mention the URL of its\r\nwebsite - www.careercenter.am, Thanks.\r\nOPENING DATE:  30 July 2012\r\nAPPLICATION DEADLINE:  29 August 2012\r\n----------------------------------\r\nTo place a free posting for job or other career-related opportunities\r\navailable in your organization, just go to the www.careercenter.am\r\nwebsite and follow the "Post an Announcement" link.'"""

doc = nlp(text)

data = [{ent.label_: ent.text}   for ent in doc.ents]

data

[{'Company': 'Aldo'},
 {'Title': 'Retail Merchandiser'},
 {'Eligibility': 'Long-term'},
 {'Salary': 'Highly competitive'}]

	Year	Month
count	19001.000000	19001.000000
mean	2010.274722	6.493869
std	3.315609	3.405503
min	2004.000000	1.000000
25%	2008.000000	3.000000
50%	2011.000000	7.000000
75%	2013.000000	9.000000
max	2015.000000	12.000000

	jobpost	date	Title	Company	AnnouncementCode	Term	Eligibility	Audience	StartDate	Duration	...	Salary	ApplicationP	OpeningDate	Deadline	Notes	AboutC	Attach	Year	Month	IT
18999	San Lazzaro LLC\r\n\r\n\r\nTITLE: Head of O...	Dec 30, 2015	Head of Online Sales Department	San Lazzaro LLC	NaN	NaN	NaN	NaN	NaN	Long-term	...	Highly competitive	Interested candidates can send their CVs to:\r...	30 December 2015	29 January 2016	NaN	San Lazzaro LLC works with several internation...	NaN	2015	12	False
19000	"Kamurj" UCO CJSC\r\n\r\n\r\nTITLE: Lawyer in...	Dec 30, 2015	Lawyer in Legal Department	"Kamurj" UCO CJSC	NaN	Full-time	NaN	NaN	NaN	Indefinite	...	NaN	All qualified applicants are encouraged to\r\n...	30 December 2015	20 January 2016	NaN	"Kamurj" UCO CJSC is providing micro and small...	NaN	2015	12	False

Pythonist

Sunday, April 4, 2021

Entity Recognition Extract information from Job posting using Spacy Machine learning Model

Entity Recognition Extract information from Job posting¶

Soumil Nitin Shah¶

Dataset download link https://www.kaggle.com/madhab/jobposts ¶

Step 1:¶

Step 2:¶

Training Data¶

Model¶

Test¶

2 comments:

SPJ Joins in Iceberg how to use them | Faster Join Avoid Shuffle

Sunday, April 4, 2021

Entity Recognition Extract information from Job posting using Spacy Machine learning Model

Entity Recognition Extract information from Job posting¶

Soumil Nitin Shah¶

Dataset download link https://www.kaggle.com/madhab/jobposts¶

Step 1:¶

Step 2:¶

Training Data¶

Model¶

Test¶

2 comments:

SPJ Joins in Iceberg how to use them | Faster Join Avoid Shuffle

Dataset download link https://www.kaggle.com/madhab/jobposts ¶