Sunday, April 4, 2021

Entity Recognition Extract information from Job posting using Spacy Machine learning Model

Untitled

Entity Recognition Extract information from Job posting

Soumil Nitin Shah

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |

Excellent experience of building scalable and high-performance Software Applications combining distinctive skill sets in Internet of Things (IoT), Machine Learning and Full Stack Web Development in Python.

Step 1:

  • define imports
In [228]:
import pandas as pd
import os
import sys
import re
import ast
from ast import literal_eval
import plac
import random
from pathlib import Path
import spacy
import re
from spacy.util import minibatch, compounding
import seaborn as sns
In [192]:
df = pd.read_csv("data job posts.csv")
In [217]:
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
Out[217]:
<matplotlib.axes._subplots.AxesSubplot at 0x21fa75601f0>
In [220]:
df.head(1)
Out[220]:
jobpost date Title Company AnnouncementCode Term Eligibility Audience StartDate Duration ... Salary ApplicationP OpeningDate Deadline Notes AboutC Attach Year Month IT
0 AMERIA Investment Consulting Company\r\nJOB TI... Jan 5, 2004 Chief Financial Officer AMERIA Investment Consulting Company NaN NaN NaN NaN NaN NaN ... NaN To apply for this position, please submit a\r\... NaN 26 January 2004 NaN NaN NaN 2004 1 False

1 rows × 24 columns

In [223]:
df["Title"].nunique()
Out[223]:
8636
In [224]:
df.describe()
Out[224]:
Year Month
count 19001.000000 19001.000000
mean 2010.274722 6.493869
std 3.315609 3.405503
min 2004.000000 1.000000
25% 2008.000000 3.000000
50% 2011.000000 7.000000
75% 2013.000000 9.000000
max 2015.000000 12.000000
  • there are 8636 uniqe job title in dataset

Step 2:

  • Data Exploration
In [193]:
df.tail(2)
Out[193]:
jobpost date Title Company AnnouncementCode Term Eligibility Audience StartDate Duration ... Salary ApplicationP OpeningDate Deadline Notes AboutC Attach Year Month IT
18999 San Lazzaro LLC\r\n\r\n\r\nTITLE: Head of O... Dec 30, 2015 Head of Online Sales Department San Lazzaro LLC NaN NaN NaN NaN NaN Long-term ... Highly competitive Interested candidates can send their CVs to:\r... 30 December 2015 29 January 2016 NaN San Lazzaro LLC works with several internation... NaN 2015 12 False
19000 "Kamurj" UCO CJSC\r\n\r\n\r\nTITLE: Lawyer in... Dec 30, 2015 Lawyer in Legal Department "Kamurj" UCO CJSC NaN Full-time NaN NaN NaN Indefinite ... NaN All qualified applicants are encouraged to\r\n... 30 December 2015 20 January 2016 NaN "Kamurj" UCO CJSC is providing micro and small... NaN 2015 12 False

2 rows × 24 columns

Training Data

In [194]:
class TrainDataGenerator(object):
    def __init__(self, text):
        self.text = text
        self.entities = []

    def add_entity(self, searchTerm = '', entity_name=''):
        try:
            response = re.search(searchTerm, self.text)
            data_entity = (response.start(), response.end(), entity_name)
            self.entities.append(data_entity)
        except Exception as e:pass

    def complete_entity(self):

        entity_tem = {"entities":self.entities}
        data = (self.text, entity_tem)

        entites = entity_tem.get("entities")

        # check if entity first index is overlapping with another one
        for i in range(0, len(entites)):
            for j in range(i+1, len(entites)):


                StartIndex1 = entites[i][0]
                endIndex1 = entites[i][1]

                StartIndex2 = entites[j][0]
                endIndex2 = entites[j][1]

                if StartIndex1 in range(StartIndex2, endIndex2):
                    return False
                if endIndex2 in range(StartIndex2, endIndex2):
                    return False

        return data
In [195]:
TRAIN_DATA = []
In [196]:
df.columns
Out[196]:
Index(['jobpost', 'date', 'Title', 'Company', 'AnnouncementCode', 'Term',
       'Eligibility', 'Audience', 'StartDate', 'Duration', 'Location',
       'JobDescription', 'JobRequirment', 'RequiredQual', 'Salary',
       'ApplicationP', 'OpeningDate', 'Deadline', 'Notes', 'AboutC', 'Attach',
       'Year', 'Month', 'IT'],
      dtype='object')
In [197]:
for x in df[["jobpost", "Title", "Company", "Salary", "Eligibility", "Duration"]].iterrows():
    
    text = x[1].jobpost
        
    _helper = TrainDataGenerator(text=text)
    
    # Title Training Data
    _helper.add_entity(entity_name='Title', searchTerm= x[1].Title)
    
    # Company Training data 
    _helper.add_entity(entity_name='Company', searchTerm=x[1].Company)
    
    
    # Salary Training data 
    _helper.add_entity(entity_name='Salary', searchTerm=x[1].Salary)
    
    # Eligibility Training data 
    _helper.add_entity(entity_name='Eligibility', searchTerm=x[1].Eligibility)

    # Duration Training data 
    _helper.add_entity(entity_name='Eligibility', searchTerm=x[1].Duration)
    
    response = _helper.complete_entity()
    if response == False:
        pass
    else:
        TRAIN_DATA.append(response)
In [198]:
len(TRAIN_DATA)
Out[198]:
18993
In [199]:
TRAIN_DATA = TRAIN_DATA[::100]
In [200]:
len(TRAIN_DATA)
Out[200]:
190

Model

In [201]:
class Model(object):

    def __init__(self, modelName = "testmodel"):
        self.nlp = spacy.blank("en")
        self.modelName = modelName

    def train(self, output_dir=None, n_iter=80):

        # create the built-in pipeline components and add them to the pipeline
        # nlp.create_pipe works for built-ins that are registered with spaCy
        if "ner" not in self.nlp.pipe_names:
            ner = self.nlp.create_pipe("ner")
            self.nlp.add_pipe(ner, last=True)

        # otherwise, get it so we can add labels
        else:
            ner = self.nlp.get_pipe("ner")

        # add labels
        for _, annotations in TRAIN_DATA:
            for ent in annotations.get("entities"):
                ner.add_label(ent[2])

        # get names of other pipes to disable them during training
        pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
        other_pipes = [pipe
                       for pipe in self.nlp.pipe_names if pipe not in pipe_exceptions]

        with self.nlp.disable_pipes(*other_pipes):  # only train NER
            # reset and initialize the weights randomly – but only if we're
            # training a new model
            self.nlp.begin_training()

            for itn in range(n_iter):
                print("Iteration : {} ".format(itn))
                random.shuffle(TRAIN_DATA)

                losses = {}
                # batch up the examples using spaCy's minibatch

                batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))

                for batch in batches:
                    texts, annotations = zip(*batch)
                    self.nlp.update(
                        texts,              # batch of texts
                        annotations,        # batch of annotations
                        drop=0.5,           # dropout - make it harder to memorise data
                        losses=losses,
                    )

                print("Losses", losses)

            self.nlp.to_disk(self.modelName)
            print('Model has been trained and saved on your disk ')
            print("use nlp = spacy.load(NAME) ")
            print("\n")
In [202]:
def main():
    train = Model(modelName='jobposting')
    response = train.train()
In [203]:
main()
c:\python38\lib\site-packages\spacy\language.py:635: UserWarning: [W033] Training a new parser or NER using a model with no lexeme normalization table. This may degrade the performance of the model to some degree. If this is intentional or the language you're using doesn't have a normalization table, please ignore this warning. If this is surprising, make sure you have the spacy-lookups-data package installed. The languages with lexeme normalization tables are currently: da, de, el, en, id, lb, pt, ru, sr, ta, th.
  proc.begin_training(
Iteration : 0 
c:\python38\lib\site-packages\spacy\language.py:482: UserWarning: [W030] Some entities could not be aligned in the text ""Motion Time" Advertising Company
TITLE:  Adverti..." with entities "[(43, 60, 'Title'), (0, 33, 'Company'), (513, 517,...". Use `spacy.gold.biluo_tags_from_offsets(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
  gold = GoldParse(doc, **gold)
Losses {'ner': 18380.97104827072}
Iteration : 1 
Losses {'ner': 4699.670381930318}
Iteration : 2 
Losses {'ner': 9878.842687040102}
Iteration : 3 
Losses {'ner': 9121.51499587028}
Iteration : 4 
Losses {'ner': 11544.020995071827}
Iteration : 5 
Losses {'ner': 10610.450039949035}
Iteration : 6 
Losses {'ner': 10369.691868495196}
Iteration : 7 
Losses {'ner': 8570.033112825418}
Iteration : 8 
Losses {'ner': 9381.052244767467}
Iteration : 9 
Losses {'ner': 7903.03582038579}
Iteration : 10 
Losses {'ner': 9444.623359981575}
Iteration : 11 
Losses {'ner': 8748.988276667384}
Iteration : 12 
Losses {'ner': 7541.176756621659}
Iteration : 13 
Losses {'ner': 6832.486853603837}
Iteration : 14 
Losses {'ner': 6947.726805507671}
Iteration : 15 
Losses {'ner': 5296.005261249757}
Iteration : 16 
Losses {'ner': 6297.691770431446}
Iteration : 17 
Losses {'ner': 3835.2003134659217}
Iteration : 18 
Losses {'ner': 4093.5110076406318}
Iteration : 19 
Losses {'ner': 3160.8572605272157}
Iteration : 20 
Losses {'ner': 3101.293733747376}
Iteration : 21 
Losses {'ner': 3007.217087266718}
Iteration : 22 
Losses {'ner': 3091.8537776147673}
Iteration : 23 
Losses {'ner': 2713.7856922169394}
Iteration : 24 
Losses {'ner': 3289.6989442914437}
Iteration : 25 
Losses {'ner': 2615.1358862667}
Iteration : 26 
Losses {'ner': 2200.379158669518}
Iteration : 27 
Losses {'ner': 1675.5066180808324}
Iteration : 28 
Losses {'ner': 1679.2881498131746}
Iteration : 29 
Losses {'ner': 2062.0508638124707}
Iteration : 30 
Losses {'ner': 1573.6250076905765}
Iteration : 31 
Losses {'ner': 1345.485321863029}
Iteration : 32 
Losses {'ner': 1594.848346285836}
Iteration : 33 
Losses {'ner': 2043.3384764484704}
Iteration : 34 
Losses {'ner': 2064.703963995858}
Iteration : 35 
Losses {'ner': 1136.830656361065}
Iteration : 36 
Losses {'ner': 1776.7798312752557}
Iteration : 37 
Losses {'ner': 1267.1696598067747}
Iteration : 38 
Losses {'ner': 984.2493475861796}
Iteration : 39 
Losses {'ner': 606.7264764451215}
Iteration : 40 
Losses {'ner': 756.7257394838367}
Iteration : 41 
Losses {'ner': 617.7859206620932}
Iteration : 42 
Losses {'ner': 673.6303049232835}
Iteration : 43 
Losses {'ner': 456.09631514137226}
Iteration : 44 
Losses {'ner': 1205.5190171253353}
Iteration : 45 
Losses {'ner': 662.2341710073242}
Iteration : 46 
Losses {'ner': 507.36188621454585}
Iteration : 47 
Losses {'ner': 610.4899416853752}
Iteration : 48 
Losses {'ner': 877.5788984592551}
Iteration : 49 
Losses {'ner': 486.9110064183544}
Iteration : 50 
Losses {'ner': 1130.0018569109063}
Iteration : 51 
Losses {'ner': 460.45876164825495}
Iteration : 52 
Losses {'ner': 590.1637958558758}
Iteration : 53 
Losses {'ner': 313.933794949397}
Iteration : 54 
Losses {'ner': 489.32273312699454}
Iteration : 55 
Losses {'ner': 341.2340635119493}
Iteration : 56 
Losses {'ner': 355.16430260488954}
Iteration : 57 
Losses {'ner': 250.98433943584354}
Iteration : 58 
Losses {'ner': 476.87904149085074}
Iteration : 59 
Losses {'ner': 399.2463960835315}
Iteration : 60 
Losses {'ner': 583.6334809564943}
Iteration : 61 
Losses {'ner': 391.8574076966954}
Iteration : 62 
Losses {'ner': 523.1551393680376}
Iteration : 63 
Losses {'ner': 334.6856271845277}
Iteration : 64 
Losses {'ner': 316.45389223542685}
Iteration : 65 
Losses {'ner': 414.64109182738804}
Iteration : 66 
Losses {'ner': 412.9363883212921}
Iteration : 67 
Losses {'ner': 211.245532639828}
Iteration : 68 
Losses {'ner': 212.5565848767217}
Iteration : 69 
Losses {'ner': 228.1667984008269}
Iteration : 70 
Losses {'ner': 262.8385774929472}
Iteration : 71 
Losses {'ner': 373.82818237721585}
Iteration : 72 
Losses {'ner': 159.02358136375446}
Iteration : 73 
Losses {'ner': 206.6122921309186}
Iteration : 74 
Losses {'ner': 380.0435376501302}
Iteration : 75 
Losses {'ner': 152.6402089288273}
Iteration : 76 
Losses {'ner': 283.5626839841529}
Iteration : 77 
Losses {'ner': 119.18966887420726}
Iteration : 78 
Losses {'ner': 129.56188815885506}
Iteration : 79 
Losses {'ner': 202.87337789039807}
Model has been trained and saved on your disk 
use nlp = spacy.load(NAME) 


Test

In [205]:
nlp = spacy.load("jobposting")
In [213]:
text = """
'Aldo\r\nTITLE:  Retail Merchandiser\r\nSTART DATE/ TIME:  Immediate employment\r\nDURATION:  Long-term\r\nLOCATION:  Yerevan, Armenia\r\nJOB DESCRIPTION:  Aldo is seeking a Retail Merchandiser to drive maximum\r\nprofitability through planning stock intake to meet budgeted sales, build\r\nrelationships and work effectively with the host brand teams.\r\nJOB RESPONSIBILITIES:\r\n- Maximize and achieve revenue and profitability targets through\r\neffective merchandise planning and selection, product strategy and\r\nplanning, pricing, promotions, inventory control and vendor management;\r\n- Manage the buying budget and process, product pricing and margin\r\nmanagement;\r\n- Control the stock management and flow planning of all incoming product\r\nlines against monthly and annual budgets that includes for building and\r\nsupporting commercial strategies and trading plans;\r\n- Create seasonal sales and buying plans in order to maximize commercial\r\nopportunity and which meets profit objectives;\r\n- Work closely with marketing and operations teams to develop\r\nadvertisement, and sales promotions as well as arrangement of product\r\ncategories to adjust store inventory levels;\r\n- Supervise instore Visual Presentation, re-layout and re-merchandising;\r\n- Responsible for in-store visual merchandisers development and talent\r\nmanagement in order to achieve merchandising business objectives.\r\nREQUIRED QUALIFICATIONS:\r\n- University Degree in Business Administration, Finance or Marketing; \r\n- At least 2 years of work experience in financial analytics,\r\nmerchandising or product management;\r\n- Relevant work experience in retail organization or environment would be\r\nan added advantage;\r\n- Excellent verbal and written communication skills in English and\r\nArmenian languages; \r\n- Proven ability to motivate others; \r\n- Excellent analytical and numerical skill;\r\n- Strong entrepreneurial spirit with a passionate commitment to the\r\ncustomer and product quality;\r\n- Strong team player with good people management and strong leadership\r\nqualities with the ability to work with people of all levels;\r\n- Willingness to travel occasionally;\r\n- PC literacy.\r\nREMUNERATION/ SALARY:  Highly competitive\r\nAPPLICATION PROCEDURES:  Interested candidates are encouraged to submit a\r\nCV to: hr.franchise@... with a note of " Retail Merchandiser " in\r\nthe subject line or call 52 57 22 for inquiries. The Group thanks all who\r\nexpress interest in this opportunity; however only those selected for an\r\ninterview will be contacted. Applications privacy and confidentiality are\r\nguaranteed.\r\nPlease clearly mention in your application letter that you learned of\r\nthis job opportunity through Career Center and mention the URL of its\r\nwebsite - www.careercenter.am, Thanks.\r\nOPENING DATE:  30 July 2012\r\nAPPLICATION DEADLINE:  29 August 2012\r\n----------------------------------\r\nTo place a free posting for job or other career-related opportunities\r\navailable in your organization, just go to the www.careercenter.am\r\nwebsite and follow the "Post an Announcement" link.'"""
In [226]:
doc = nlp(text)

data = [{ent.label_: ent.text}   for ent in doc.ents]
In [227]:
data
Out[227]:
[{'Company': 'Aldo'},
 {'Title': 'Retail Merchandiser'},
 {'Eligibility': 'Long-term'},
 {'Salary': 'Highly competitive'}]

2 comments:

  1. ---------------------------------------------------------------------------
    ValueError Traceback (most recent call last)
    ~\AppData\Local\Temp/ipykernel_20000/1817008154.py in
    3 response = train.train()
    4
    ----> 5 main()

    ~\AppData\Local\Temp/ipykernel_20000/1817008154.py in main()
    1 def main():
    2 train = Model(modelName='jobposting')
    ----> 3 response = train.train()
    4
    5 main()

    ~\AppData\Local\Temp/ipykernel_20000/4253876312.py in train(self, output_dir, n_iter)
    11 if "ner" not in self.nlp.pipe_names:
    12 ner = self.nlp.create_pipe("ner")
    ---> 13 self.nlp.add_pipe(ner, last=True)
    14
    15

    C:\Python38\lib\site-packages\spacy\language.py in add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    753 bad_val = repr(factory_name)
    754 err = Errors.E966.format(component=bad_val, name=name)
    --> 755 raise ValueError(err)
    756 name = name if name is not None else factory_name
    757 if name in self.component_names:

    ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got (name: 'None').

    - If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.

    - If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.

    - If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.

    ReplyDelete
  2. Pythonist: Entity Recognition Extract Information From Job Posting Using Spacy Machine Learning Model >>>>> Download Now

    >>>>> Download Full

    Pythonist: Entity Recognition Extract Information From Job Posting Using Spacy Machine Learning Model >>>>> Download LINK

    >>>>> Download Now

    Pythonist: Entity Recognition Extract Information From Job Posting Using Spacy Machine Learning Model >>>>> Download Full

    >>>>> Download LINK

    ReplyDelete

Develop Full Text Search (Semantics Search) with Postgres (PGVector) and Python Hands on Lab

final-notebook Develop Full Text Search (Semantics Search) with Postgres (PGVector)...