Soumil Nitin Shah¶

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |¶

Website : https://soumilshah.herokuapp.com
Github: https://github.com/soumilshah1995
Linkedin: https://www.linkedin.com/in/shah-soumil/
Blog: https://soumilshah1995.blogspot.com/
Youtube : https://www.youtube.com/channel/UC_eOodxvwS_H7x2uLQa-svw?view_as=subscriber
Facebook Page : https://www.facebook.com/soumilshah1995/
Email : shahsoumil519@gmail.com

Question :¶

Your company gave you two csv file with more than 10,000 Email address they want you to send them a file which has common email address between both of them and also a file which has unique email betweeen both the csv files you cannot use excel use python to solve this challenge¶

df1.head(4)

df2.head(4)

Algorithms and Steps¶

Convert the email into array

l1 = df1["Emails"].to_list()
l2 = df2["Emails"].to_list()

len(l1)

10000

len(l2)

10000

Create a list of Duplicate Email Address along with their index¶

Duplicate = collections.namedtuple("Emails", "index email")
tem =[]

for i in range(0, len(l1)):
    for j in range(0, len(l2)):
        if (l1[i] == l2[j]):
            tem.append(Duplicate(i, l1[i]))
        else:
            pass

len(tem)

127

Create a array with just the index where this duplicate values are¶

Index_remove = []

for x in tem:
    Index_remove.append(x.index)

These are index where Duplicate email are there¶

Index_remove[0:12]

[17, 169, 215, 263, 263, 270, 271, 283, 414, 434, 435, 442]

Iterate over Email¶

chceck is email index exists in index_reove if yes pass else append
append value would be your unique emails

newList = []
for c,x1 in enumerate(l1):
    if c in Index_remove:
        pass
    else:
        newList.append(x1)

create a Pandas DataFrame

myuni = pd.DataFrame(data={
    "UniqueEmail":newList
})

myuni

	Emails
0	linjason@gmail.com
1	howellian@heath.biz
2	angelaharris@sloan-marshall.net
3	ngilbert@jones.biz

	Emails
0	osimmons@yahoo.com
1	martinsimon@gmail.com
2	yferrell@smith.info
3	brittany18@hotmail.com

	UniqueEmail
0	valenciaariel@shelton-peterson.com
1	mark96@andersen.com
2	warrenjames@garrison.info
3	hmitchell@yahoo.com
4	joshuadickerson@gmail.com
...	...
9875	uodom@yahoo.com
9876	jasonprice@gmail.com
9877	jeremy37@hotmail.com
9878	ojohnson@hotmail.com
9879	phillip45@bennett-romero.com

Pythonist

Saturday, March 21, 2020

How to remove Duplicate Email address from Two Csv File using Python | Algorithm

Soumil Nitin Shah¶

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |¶

Question :¶

Your company gave you two csv file with more than 10,000 Email address they want you to send them a file which has common email address between both of them and also a file which has unique email betweeen both the csv files you cannot use excel use python to solve this challenge¶

Algorithms and Steps¶

Create a list of Duplicate Email Address along with their index¶

Create a array with just the index where this duplicate values are¶

These are index where Duplicate email are there¶

Iterate over Email¶

Getting started with LakeFS and Apache Iceberg Running Locally