Thursday, July 16, 2020

Machine learning + Elastic Search Get me Similar Movie Title

Untitled

Machine learning + Elastic Search Get me Similar Movie Title

About Myself
  • Hello! I’m Soumil Nitin Shah, a Software and Hardware Developer based in New York City. I have completed by Bachelor in Electronic Engineering and my Double master’s in Computer and Electrical Engineering. I Develop Python Based Cross Platform Desktop Application , Webpages , Software, REST API, Database and much more I have more than 2 Years of Experience in Python

  • Website : http://soumilshah.herokuapp.com/

  • Youtube :https://www.youtube.com/channel/UC_eOodxvwS_H7x2uLQa-svw Currently i work as a Software Engineer at JobTarget

Step 1:

  • define Imports
In [1]:
try:
    import elasticsearch
    from elasticsearch import Elasticsearch
    
    import pandas as pd
    import json
    from ast import literal_eval
    from tqdm import tqdm
    import datetime
    import os
    import sys
    
    import os
    import tensorflow as tf
    import tensorflow_hub as hub
    import numpy as np
    
    from elasticsearch import helpers
    print("Loaded  .. . . . . . . .")
except Exception as E:
    print("Some Modules are Missing {} ".format(e))
Loaded  .. . . . . . . .

Step 2:

  • read the Dataset
In [2]:
[x for x in os.listdir()]
Out[2]:
['.ipynb_checkpoints',
 'assets',
 'netflix_titles.csv',
 'saved_model.pb',
 'Untitled.ipynb',
 'variables']
In [3]:
df = pd.read_csv("netflix_titles.csv")
In [5]:
df.head(1)
Out[5]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...

We shall Perfrom ML on title

  • Convert all Title in Vectors
In [8]:
# Load the ML Model 
embed = hub.KerasLayer(os.getcwd())
In [17]:
def apply_transform(x):
    
    tem = str(x)
    x = tf.constant([tem])
    embeddings = embed(x)
    x = np.asarray(embeddings)
    x = x[0].tolist()
    return x
In [18]:
df["ml_vector"] = df["title"].apply(apply_transform)
In [21]:
len(df["ml_vector"].iloc[0])
Out[21]:
20
In [19]:
df.head(2)
Out[19]:
show_id type title director cast country date_added release_year rating duration listed_in description ml_vector
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra... [-0.6707797646522522, 0.40203648805618286, 0.0...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra... [-0.08986619859933853, 0.027812093496322632, 0...

We Converted all Title into Vector

Define ELK Mappings

In [57]:
Settings = {
   "settings":{
      "number_of_shards":1,
      "number_of_replicas":0
   },
   "mappings":{
       "properties":{
          "ml_vector":{
         "type":"dense_vector",
         "dims":20
      } 
    }
   }
}
In [58]:
ENDPOINT = "http://localhost:9200/"
es = Elasticsearch(timeout=600,hosts=ENDPOINT)
In [59]:
es.ping()
Out[59]:
True
In [70]:
IndexName = 'netflix_ml'
my = es.indices.create(index=IndexName, ignore=[400,404], body=Settings)
In [71]:
my
Out[71]:
{'acknowledged': True, 'shards_acknowledged': True, 'index': 'netflix_ml'}

Transform Data

In [72]:
df.columns
Out[72]:
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'ml_vector'],
      dtype='object')
In [73]:
def generator(df2):
    for c, line in enumerate(df2):
        yield {
    '_index': 'netflix_ml',
    '_type': '_doc',
    '_id': c,
    '_source': {
        "title":line.get("title", ""),
       'director':line.get('director', ""),
        'description':line.get('description', ""),
        'ml_vector':line.get('ml_vector', "")
    }
        }
    raise StopIteration
    

How Single Record Looks Like

In [74]:
df22 = df.to_dict('records')
next(generator(df22))
Out[74]:
{'_index': 'netflix_ml',
 '_type': '_doc',
 '_id': 0,
 '_source': {'title': 'Norm of the North: King Sized Adventure',
  'director': 'Richard Finn, Tim Maltby',
  'description': 'Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from an evil archaeologist first.',
  'ml_vector': [-0.6707797646522522,
   0.40203648805618286,
   0.09418975561857224,
   -1.3572536706924438,
   0.3880384862422943,
   0.6536661386489868,
   -0.1453024446964264,
   -0.6922371983528137,
   0.46747857332229614,
   0.2289315164089203,
   -1.055234432220459,
   -0.8188006281852722,
   0.8035492897033691,
   -0.45699596405029297,
   0.8889529705047607,
   -0.3411760926246643,
   -1.2927576303482056,
   1.1073145866394043,
   0.20581303536891937,
   0.7649598717689514]}}

Upload

In [75]:
try:
    res = helpers.bulk(es, generator(df22))
    print("Working")
except Exception as e:
    pass

Test

In [76]:
title = "Krish Trish and Baltiboy: Best Friends Forever"
tem = str(x)
x = tf.constant([tem])
embeddings = embed(x)
x = np.asarray(embeddings)
x = x[0].tolist()
x
Out[76]:
[-3.791809320449829,
 -0.08327436447143555,
 -3.871356964111328,
 -0.8041074275970459,
 -0.6908162832260132,
 0.8903639912605286,
 -3.3496391773223877,
 -3.4853692054748535,
 -3.1927621364593506,
 1.1087660789489746,
 -1.3533275127410889,
 2.607429027557373,
 3.3263041973114014,
 -1.1302865743637085,
 2.3130712509155273,
 -1.810701847076416,
 -5.419280529022217,
 2.3996806144714355,
 4.2233147621154785,
 0.04211375117301941]

Query

In [67]:
Query = {
   "_source":[
      "title"
   ],
   "size":100,
   "query":{
      "script_score":{
         "query":{
            "match":{
               "title":"Krish Trish and Baltiboy: Best Friends Forever"
            }
         },
         "script":{
            "source":"cosineSimilarity(params.query_vector, 'ml_vector') + 1.0",
            "params":{
               "query_vector":[
                  -4.9311370849609375,
                  -0.3049483299255371,
                  -3.552788734436035,
                  -0.737078070640564,
                  1.7232768535614014,
                  0.8952591419219971,
                  -3.95497465133667,
                  -2.081494092941284,
                  -3.7464582920074463,
                  3.05448317527771,
                  2.427945137023926,
                  2.4168589115142822,
                  3.5033276081085205,
                  -2.7748589515686035,
                  4.356207847595215,
                  -2.048246383666992,
                  -4.424686908721924,
                  3.495077610015869,
                  4.518932819366455,
                  -0.9115778207778931
               ]
            }
         }
      }
   }
}

Similar Movies title with Cosine Sim

In [ ]:
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 11,
      "relation" : "eq"
    },
    "max_score" : 1.6070579,
    "hits" : [
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "441",
        "_score" : 1.6070579,
        "_source" : {
          "title" : "The Death and Life of Marsha P. Johnson"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "14",
        "_score" : 1.4465705,
        "_source" : {
          "title" : "Krish Trish and Baltiboy: Best Friends Forever"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "341",
        "_score" : 1.4409094,
        "_source" : {
          "title" : "Love and Shukla"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "18",
        "_score" : 1.4052283,
        "_source" : {
          "title" : "Krish Trish and Baltiboy: The Greatest Trick"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "13",
        "_score" : 1.2392325,
        "_source" : {
          "title" : "Krish Trish and Baltiboy: Battle of Wits"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "40",
        "_score" : 1.2374816,
        "_source" : {
          "title" : "Hell and Back"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "15",
        "_score" : 1.1443582,
        "_source" : {
          "title" : "Krish Trish and Baltiboy: Comics of India"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "86",
        "_score" : 1.1186142,
        "_source" : {
          "title" : "Cultivating the Seas: History and Future of the Full-Cycle Cultured Kindai Tuna"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "189",
        "_score" : 1.09444,
        "_source" : {
          "title" : "Come and Find Me"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "306",
        "_score" : 1.0660063,
        "_source" : {
          "title" : "Just Friends"
        }
      },
      {
        "_index" : "netflix_ml",
        "_type" : "_doc",
        "_id" : "16",
        "_score" : 0.9835859,
        "_source" : {
          "title" : "Krish Trish and Baltiboy: Oversmartness Never Pays"
        }
      }
    ]
  }
}

No comments:

Post a Comment

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

publish How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide ¶ In [24]: from ...