Sunday, April 4, 2021

Simple Machine Learning Model to Predict New category | multi class classification

Spacy Multi Class classification

NLP Multi Classification on Text using NLP :D

Soumil Nitin Shah

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |

Excellent experience of building scalable and high-performance Software Applications combining distinctive skill sets in Internet of Things (IoT), Machine Learning and Full Stack Web Development in Python.

In [1]:
try:
    import json
    import os
    
    import pandas as  pd
    import spacy
    
    import seaborn as sns
    import string

    from tqdm import tqdm
    from textblob import TextBlob
    
    from nltk.corpus import stopwords
    import nltk
    from nltk.stem import WordNetLemmatizer
    from nltk import word_tokenize
    import re
    
    
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.feature_extraction.text import TfidfTransformer
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.pipeline import Pipeline
    
    
    from sklearn.preprocessing import FunctionTransformer
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.pipeline import FeatureUnion
    from sklearn.feature_extraction import DictVectorizer
    
    import swifter
    
    tqdm.pandas()
except Exception as e:
    print("Error : {} ".format(e))
In [459]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\s.shah\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\s.shah\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\s.shah\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\s.shah\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
Out[459]:
True
In [2]:
df = pd.read_json("News_Category_Dataset_v2.json", lines=True)
In [3]:
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x2f7e7ed3f70>
In [4]:
df['category'].value_counts().plot( kind='bar', figsize=(15,10))
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x2f7ecaede80>
In [5]:
#df = df.head(6000)
In [6]:
df.columns
Out[6]:
Index(['category', 'headline', 'authors', 'link', 'short_description', 'date'], dtype='object')
In [7]:
df.describe()
Out[7]:
category headline authors link short_description date
count 200853 200853 200853 200853 200853 200853
unique 41 199344 27993 200812 178353 2309
top POLITICS Sunday Roundup https://www.huffingtonpost.comhttp://testkitch... 2013-01-17 00:00:00
freq 32739 90 36620 2 19712 100
first NaN NaN NaN NaN NaN 2012-01-28 00:00:00
last NaN NaN NaN NaN NaN 2018-05-26 00:00:00
In [8]:
df.isna().sum()
Out[8]:
category             0
headline             0
authors              0
link                 0
short_description    0
date                 0
dtype: int64
In [9]:
df.head(2)
Out[9]:
category headline authors link short_description date
0 CRIME There Were 2 Mass Shootings In Texas Last Week... Melissa Jeltsen https://www.huffingtonpost.com/entry/texas-ama... She left her husband. He killed their children... 2018-05-26
1 ENTERTAINMENT Will Smith Joins Diplo And Nicky Jam For The 2... Andy McDonald https://www.huffingtonpost.com/entry/will-smit... Of course it has a song. 2018-05-26
In [10]:
df['category'].unique()
Out[10]:
array(['CRIME', 'ENTERTAINMENT', 'WORLD NEWS', 'IMPACT', 'POLITICS',
       'WEIRD NEWS', 'BLACK VOICES', 'WOMEN', 'COMEDY', 'QUEER VOICES',
       'SPORTS', 'BUSINESS', 'TRAVEL', 'MEDIA', 'TECH', 'RELIGION',
       'SCIENCE', 'LATINO VOICES', 'EDUCATION', 'COLLEGE', 'PARENTS',
       'ARTS & CULTURE', 'STYLE', 'GREEN', 'TASTE', 'HEALTHY LIVING',
       'THE WORLDPOST', 'GOOD NEWS', 'WORLDPOST', 'FIFTY', 'ARTS',
       'WELLNESS', 'PARENTING', 'HOME & LIVING', 'STYLE & BEAUTY',
       'DIVORCE', 'WEDDINGS', 'FOOD & DRINK', 'MONEY', 'ENVIRONMENT',
       'CULTURE & ARTS'], dtype=object)

B

pre processing
  • remove the punctuation from text
  • make lowercase because we assume that punctuation and letter case don’t influence the meaning of words.
  • lemmatization
In [11]:
# Used this snippets of code from 
# https://github.com/ArmandDS/news_category/blob/master/News_Analysis_AO.ipynb

stop_words_ = set(stopwords.words('english'))
wn = WordNetLemmatizer()
my_sw = ['make', 'amp',  'news','new' ,'time', 'u','s', 'photos',  'get', 'say']

def black_txt(token):
    return  token not in stop_words_ and token not in list(string.punctuation)  and len(token)>2 and token not in my_sw
  
def clean_txt(text):
    clean_text = []
    clean_text2 = []
    text = re.sub("'", "",text)
    text=re.sub("(\\d|\\W)+"," ",text)    
    clean_text = [ wn.lemmatize(word, pos="v") for word in word_tokenize(text.lower()) if black_txt(word)]
    clean_text2 = [word for word in clean_text if black_txt(word)]
    return " ".join(clean_text2)
In [12]:
def subj_txt(text):
    return  TextBlob(text).sentiment[1]

def polarity_txt(text):
    return TextBlob(text).sentiment[0]

def len_text(text):
    if len(text.split())>0:
         return len(set(clean_txt(text).split()))/ len(text.split())
    else:
         return 0
In [13]:
df['text'] = df['headline']  +  " " + df['short_description']

df['text'] = df['text'].swifter.apply(clean_txt)
df['polarity'] = df['text'].swifter.apply(polarity_txt)
df['subjectivity'] = df['text'].swifter.apply(subj_txt)
df['len'] = df['text'].swifter.apply(lambda x: len(x))




label Encoding
In [15]:
X = df[['text', 'polarity', 'subjectivity','len']]
y =df['category']

encoder = LabelEncoder()
y = encoder.fit_transform(y)

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
v = dict(zip(list(y), df['category'].to_list()))
Trying Model
In [17]:
text_clf = Pipeline([
...     ('vect', CountVectorizer(analyzer="word", stop_words="english")),
...     ('tfidf', TfidfTransformer(use_idf=True)),
...     ('clf', MultinomialNB(alpha=.01)),
... ])
In [18]:
text_clf.fit(x_train['text'].to_list(), list(y_train))
Out[18]:
Pipeline(steps=[('vect', CountVectorizer(stop_words='english')),
                ('tfidf', TfidfTransformer()),
                ('clf', MultinomialNB(alpha=0.01))])

Testing Model

In [554]:
import numpy as np
In [555]:
X_TEST = x_test['text'].to_list()
Y_TEST = list(y_test)
In [556]:
predicted = text_clf.predict(X_TEST)
In [557]:
c = 0

for doc, category in zip(X_TEST, predicted):
    
    if c == 2:break
    
    print("-"*55)
    print(doc)
    print(v[category])
    print("-"*55)

    c = c + 1 
-------------------------------------------------------
twiggy model leather collection prove hasnt lose edge twiggy check model piece collection available shop thursday
STYLE & BEAUTY
-------------------------------------------------------
-------------------------------------------------------
cities rally around paris deal reminder global problems local solutions cla europe lead germany france also step fray planet great french president emmanuel
THE WORLDPOST
-------------------------------------------------------

Accuracy

In [558]:
np.mean(predicted == Y_TEST)
Out[558]:
0.553035772074382

Prediction

In [544]:
docs_new = ['Ten Months After George Floyd’s Death, Minneapolis Residents Are at War Over Policing']
In [545]:
predicted = text_clf.predict(docs_new)
In [546]:
v[predicted[0]]
Out[546]:
'POLITICS'
Saving the Model
In [547]:
import pickle
with open('model.pkl','wb') as f:
    pickle.dump(text_clf,f)
In [548]:
# load
with open('model.pkl', 'rb') as f:
    clf2 = pickle.load(f)
In [549]:
docs_new = ['Ten Months After George Floyd’s Death, Minneapolis Residents Are at War Over Policing']
predicted = clf2.predict(docs_new)
In [550]:
v[predicted[0]]
Out[550]:
'POLITICS'

4 comments:

  1. Replies
    1. Pythonist: Simple Machine Learning Model To Predict New Category >>>>> Download Now

      >>>>> Download Full

      Pythonist: Simple Machine Learning Model To Predict New Category >>>>> Download LINK

      >>>>> Download Now

      Pythonist: Simple Machine Learning Model To Predict New Category >>>>> Download Full

      >>>>> Download LINK 4A

      Delete
  2. Hey Soumil Could You Please Upload a series of videos on ML for beginners, I'll like it if you will make it.

    ReplyDelete
  3. Pythonist: Simple Machine Learning Model To Predict New Category >>>>> Download Now

    >>>>> Download Full

    Pythonist: Simple Machine Learning Model To Predict New Category >>>>> Download LINK

    >>>>> Download Now

    Pythonist: Simple Machine Learning Model To Predict New Category >>>>> Download Full

    >>>>> Download LINK

    ReplyDelete

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide

publish How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guide ¶ In [24]: from ...