Sunday, January 23, 2022

Power of Semantics Search combined with Elastic Search | ML on ELK

master

Power of Semantics Search combined with Elastic Search | ML on ELK

Soumil Nitin Shah

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |



Step 1: Define imports
In [4]:
try:
    import json
    import os
    import uuid

    import pandas as pd
    import numpy as np

    import elasticsearch
    from elasticsearch import Elasticsearch
    from elasticsearch import helpers
    from sentence_transformers import SentenceTransformer, util
    from tqdm import tqdm
    from dotenv import load_dotenv
    load_dotenv("secret.env")
except Exception as e:
    print("Some Modules are Missing :{}".format(e))

Step 2: Define helper classes

In [5]:
class Reader(object):

    def __init__(self, file_name):
        self.file_name = file_name

    def run(self):

        df = pd.read_csv(self.file_name, chunksize=3000)
        df = next(df)
        df = df.fillna("")
        return df
  • This class will Convert given text into Tokens
In [6]:
class Tokenizer(object):
    def __init__(self):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')

    def get_token(self, documents):
        sentences  = [documents]
        sentence_embeddings = self.model.encode(sentences)
        _ = list(sentence_embeddings.flatten())
        encod_np_array = np.array(_)
        encod_list = encod_np_array.tolist()
        return encod_list
In [7]:
class ElasticSearchImports(object):
    def __init__(self, df, index_name='posting'):
        self.df = df
        self.index_name = index_name
        self.es = Elasticsearch(timeout=600,hosts=os.getenv("ENDPOINT"))

    def run(self):

        elk_data = self.df.to_dict("records")

        for job in elk_data:
            try:self.es.index(index=self.index_name,body=job)
            except Exception as e:pass

        return True

Step 3: Converting column to Vector Enbeddings

In [25]:
helper = Reader(file_name="data job posts.csv")
df = helper.run()
In [26]:
tqdm.pandas()
helper_token = Tokenizer()
df["vectors"] = df["jobpost"].progress_apply(helper_token.get_token)
100%|████████████████████████████████████████████████████████████████████████████| 3000/3000 [12:00<00:00,  4.17it/s]
In [70]:
helper_elk = ElasticSearchImports(df=df)
helper_elk.run()
c:\python38\lib\site-packages\cryptography\hazmat\backends\openssl\x509.py:15: CryptographyDeprecationWarning: This version of cryptography contains a temporary pyOpenSSL fallback path. Upgrade pyOpenSSL now.
  warnings.warn(
Out[70]:
True

Step 4: test

In [14]:
helper_token = Tokenizer()
INPUT = input("Enter the Input Query ")
token_vector = helper_token.get_token(INPUT)

query ={
  
   "size":50,
   "_source": "Title", 
   "query":{
      "bool":{
         "must":[
            {
               "knn":{
                  "vectors":{
                     "vector":token_vector,
                     "k":20
                  }
               }
            }
         ]
      }
   }
}
es = Elasticsearch(timeout=600, hosts=os.getenv("ENDPOINT"))
res = es.search(index='posting',
                size=50,
                body=query,
                request_timeout=55)

title = [x['_source']  for x in res['hits']['hits']]
title
Enter the Input Query I am looking for someone to make website and Webdeveloper
Out[14]:
[{'Title': 'Web Designer'},
 {'Title': 'Web Developer'},
 {'Title': 'Web Designer'},
 {'Title': 'Web Designer (Independent Contractor)'},
 {'Title': 'Web Designer'},
 {'Title': 'Web Developer/ Programmer'},
 {'Title': 'Web Designer'},
 {'Title': 'Photoshop Graphics Web Designer'},
 {'Title': 'Photoshop Graphics Web Designer'},
 {'Title': 'Web Programmer'},
 {'Title': 'Web designer'},
 {'Title': 'Web Designer'},
 {'Title': 'Web Developer'},
 {'Title': 'Linux Administrator/ Developer'},
 {'Title': 'Web Programmer & Designer'},
 {'Title': 'Web Systems Group Engineer'},
 {'Title': 'PHP Programmer'},
 {'Title': 'Web Developer'},
 {'Title': 'Web Designer'},
 {'Title': 'Web Programmer & Designer'},
 {'Title': 'Web Developer'},
 {'Title': 'Web Designer'},
 {'Title': 'Web Designer'},
 {'Title': 'Web Designer/ Developer'},
 {'Title': 'Web Designer'},
 {'Title': 'Web Designer/ Developer'},
 {'Title': 'Web Designer'},
 {'Title': 'Web Designer'},
 {'Title': 'ASP.NET (C#) Web Developer'},
 {'Title': 'Web-Designer'},
 {'Title': 'Web Developer'},
 {'Title': 'Web Designer'},
 {'Title': 'Web Programmer/ Coder'},
 {'Title': 'Web Developer'},
 {'Title': 'WEb Designer'},
 {'Title': 'Web Developer'},
 {'Title': 'Web Programmer'},
 {'Title': 'Designer'},
 {'Title': 'Web Developer'},
 {'Title': 'Web Designer'},
 {'Title': 'Web Developer'},
 {'Title': 'Graphic Designer (Web/Print)'},
 {'Title': 'Software Developer/ Programmer'},
 {'Title': 'Web Programmer'},
 {'Title': 'Techincal Writer'},
 {'Title': 'Search Engine Optimization Specialists'},
 {'Title': 'Web Developer'},
 {'Title': 'Junior Engineer, Web Systems Department'},
 {'Title': 'Programmer'},
 {'Title': 'Senior ASP/ASP.NET Developer'}]
In [ ]:
 

5 comments:

  1. I'm using ElasticSearch 8.3 and I'm getting the following error:
    RequestError: RequestError(400, 'parsing_exception', 'unknown query [knn]')

    Is there anything additional that is needed in my instance of ElasticSearch?

    ReplyDelete
  2. I follow. You and your code step step by it did not return anything.

    ReplyDelete

Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups

Untitled2 Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups ¶ ...