Thursday, July 16, 2020

Machine learning + Elastic Search Get me Similar Movie Title

Untitled

Machine learning + Elastic Search Get me Similar Movie Title

About Myself
  • Hello! I’m Soumil Nitin Shah, a Software and Hardware Developer based in New York City. I have completed by Bachelor in Electronic Engineering and my Double master’s in Computer and Electrical Engineering. I Develop Python Based Cross Platform Desktop Application , Webpages , Software, REST API, Database and much more I have more than 2 Years of Experience in Python

  • Website : http://soumilshah.herokuapp.com/

  • Youtube :https://www.youtube.com/channel/UC_eOodxvwS_H7x2uLQa-svw Currently i work as a Software Engineer at JobTarget

Step 1:

  • define Imports
In [1]:
try:
    import elasticsearch
    from elasticsearch import Elasticsearch
    
    import pandas as pd
    import json
    from ast import literal_eval
    from tqdm import tqdm
    import datetime
    import os
    import sys
    
    import os
    import tensorflow as tf
    import tensorflow_hub as hub
    import numpy as np
    
    from elasticsearch import helpers
    print("Loaded  .. . . . . . . .")
except Exception as E:
    print("Some Modules are Missing {} ".format(e))
Loaded  .. . . . . . . .

Step 2:

  • read the Dataset
In [2]:
[x for x in os.listdir()]
Out[2]:
['.ipynb_checkpoints',
 'assets',
 'netflix_titles.csv',
 'saved_model.pb',
 'Untitled.ipynb',
 'variables']
In [3]:
df = pd.read_csv("netflix_titles.csv")
In [5]:
df.head(1)
Out[5]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...

We shall Perfrom ML on title

  • Convert all Title in Vectors
In [8]:
# Load the ML Model 
embed = hub.KerasLayer(os.getcwd())
In [17]:
def apply_transform(x):
    
    tem = str(x)
    x = tf.constant([tem])
    embeddings = embed(x)
    x = np.asarray(embeddings)
    x = x[0].tolist()
    return x
In [18]:
df["ml_vector"] = df["title"].apply(apply_transform)
In [21]:
len(df["ml_vector"].iloc[0])
Out[21]:
20
In [19]:
df.head(2)
Out[19]:
show_id type title director cast country date_added release_year rating duration listed_in description ml_vector
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra... [-0.6707797646522522, 0.40203648805618286, 0.0...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra... [-0.08986619859933853, 0.027812093496322632, 0...

We Converted all Title into Vector

Define ELK Mappings

In [57]:
Settings = {
   "settings":{
      "number_of_shards":1,
      "number_of_replicas":0
   },
   "mappings":{
       "properties":{
          "ml_vector":{
         "type":"dense_vector",
         "dims":20
      } 
    }
   }
}
In [58]:
ENDPOINT = "http://localhost:9200/"
es = Elasticsearch(timeout=600,hosts=ENDPOINT)
In [59]:
es.ping()
Out[59]:
True
In [70]:
IndexName = 'netflix_ml'
my = es.indices.create(index=IndexName, ignore=[400,404], body=Settings)
In [71]:
my
Out[71]:
{'acknowledged': True, 'shards_acknowledged': True, 'index': 'netflix_ml'}

Transform Data

In [72]:
df.columns
Out[72]:
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'ml_vector'],
      dtype='object')
In [73]:
def generator(df2):
    for c, line in enumerate(df2):
        yield {
    '_index': 'netflix_ml',
    '_type': '_doc',
    '_id': c,
    '_source': {
        "title":line.get("title", ""),
       'director':line.get('director', ""),
        'description':line.get('description', ""),
        'ml_vector':line.get('ml_vector', "")
    }
        }
    raise StopIteration
    

How Single Record Looks Like

In [74]:
df22 = df.to_dict('records')
next(generator(df22))
Out[74]:
{'_index': 'netflix_ml',
 '_type': '_doc',
 '_id': 0,
 '_source': {'title': 'Norm of the North: King Sized Adventure',
  'director': 'Richard Finn, Tim Maltby',
  'description': 'Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from an evil archaeologist first.',
  'ml_vector': [-0.6707797646522522,
   0.40203648805618286,
   0.09418975561857224,
   -1.3572536706924438,
   0.3880384862422943,
   0.6536661386489868,
   -0.1453024446964264,
   -0.6922371983528137,
   0.46747857332229614,
   0.2289315164089203,
   -1.055234432220459,
   -0.8188006281852722,
   0.8035492897033691,
   -0.45699596405029297,
   0.8889529705047607,
   -0.3411760926246643,
   -1.2927576303482056,
   1.1073145866394043,
   0.20581303536891937,
   0.7649598717689514]}}

Upload

In [75]:
try:
    res = helpers.bulk(es, generator(df22))
    print("Working")
except Exception as e:
    pass

Test

In [76]:
title = "Krish Trish and Baltiboy: Best Friends Forever"
tem = str(x)
x = tf.constant([tem])
embeddings = embed(x)
x = np.asarray(embeddings)
x = x[0].tolist()
x
Out[76]:
[-3.791809320449829,
 -0.08327436447143555,
 -3.871356964111328,
 -0.8041074275970459,
 -0.6908162832260132,
 0.8903639912605286,
 -3.3496391773223877,
 -3.4853692054748535,
 -3.1927621364593506,
 1.1087660789489746,
 -1.3533275127410889,
 2.607429027557373,
 3.3263041973114014,
 -1.1302865743637085,
 2.3130712509155273,
 -1.810701847076416,
 -5.419280529022217,
 2.3996806144714355,
 4.2233147621154785,
 0.04211375117301941]

Query

In [67]:
Query = {
   "_source":[
      "title"
   ],
   "size":100,
   "query":{
      "script_score":{
         "query":{
            "match":{
               "title":"Krish Trish and Baltiboy: Best Friends Forever"
            }
         },
         "script":{
            "source":"cosineSimilarity(params.query_vector, 'ml_vector') + 1.0",
            "params":{
               "query_vector":[
                  -4.9311370849609375,
                  -0.3049483299255371,
                  -3.552788734436035,
                  -0.737078070640564,
                  1.7232768535614014,
                  0.8952591419219971,
                  -3.95497465133667,
                  -2.081494092941284,
                  -3.7464582920074463,
                  3.05448317527771,
                  2.427945137023926,
                  2.4168589115142822,
                  3.5033276081085205,
                  -2.7748589515686035,
                  4.356207847595215,
                  -2.048246383666992,
                  -4.424686908721924,
                  3.495077610015869,
                  4.518932819366455,
                  -0.9115778207778931
               ]
            }
         }
      }
   }
}

Similar Movies title with Cosine Sim

In [ ]:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 11,
      "relation" : "eq"
    },
    "max_score" : 1.6070579,
    "hits" : [
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "441",
        "_score" : 1.6070579,
        "_source" : {
          "title" : "The Death and Life of Marsha P. Johnson"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "14",
        "_score" : 1.4465705,
        "_source" : {
          "title" : "Krish Trish and Baltiboy: Best Friends Forever"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "341",
        "_score" : 1.4409094,
        "_source" : {
          "title" : "Love and Shukla"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "18",
        "_score" : 1.4052283,
        "_source" : {
          "title" : "Krish Trish and Baltiboy: The Greatest Trick"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "13",
        "_score" : 1.2392325,
        "_source" : {
          "title" : "Krish Trish and Baltiboy: Battle of Wits"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "40",
        "_score" : 1.2374816,
        "_source" : {
          "title" : "Hell and Back"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "15",
        "_score" : 1.1443582,
        "_source" : {
          "title" : "Krish Trish and Baltiboy: Comics of India"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "86",
        "_score" : 1.1186142,
        "_source" : {
          "title" : "Cultivating the Seas: History and Future of the Full-Cycle Cultured Kindai Tuna"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "189",
        "_score" : 1.09444,
        "_source" : {
          "title" : "Come and Find Me"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "306",
        "_score" : 1.0660063,
        "_source" : {
          "title" : "Just Friends"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "16",
        "_score" : 0.9835859,
        "_source" : {
          "title" : "Krish Trish and Baltiboy: Oversmartness Never Pays"
        }
      }
    ]
  }
}

No comments:

Post a Comment

Learn How to Connect to the Glue Data Catalog using AWS Glue Iceberg REST endpoint

gluecat Learn How to Connect to the Glue Data Catalog using AWS Glue Iceberg REST e...