Saturday, March 21, 2020

How to remove Duplicate Email address from Two Csv File using Python | Algorithm

EmailsMaster

Soumil Nitin Shah

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |

Question :

Your company gave you two csv file with more than 10,000 Email address they want you to send them a file which has common email address between both of them and also a file which has unique email betweeen both the csv files you cannot use excel use python to solve this challenge

In [10]:
df1.head(4)
Out[10]:
Emails
0 linjason@gmail.com
1 howellian@heath.biz
2 angelaharris@sloan-marshall.net
3 ngilbert@jones.biz
In [13]:
df2.head(4)
Out[13]:
Emails
0 osimmons@yahoo.com
1 martinsimon@gmail.com
2 yferrell@smith.info
3 brittany18@hotmail.com

Algorithms and Steps

  • Convert the email into array
In [35]:
l1 = df1["Emails"].to_list()
l2 = df2["Emails"].to_list()
In [22]:
len(l1)
Out[22]:
10000
In [23]:
len(l2)
Out[23]:
10000
Create a list of Duplicate Email Address along with their index
In [24]:
Duplicate = collections.namedtuple("Emails", "index email")
tem =[]
In [25]:
for i in range(0, len(l1)):
    for j in range(0, len(l2)):
        if (l1[i] == l2[j]):
            tem.append(Duplicate(i, l1[i]))
        else:
            pass
        
In [29]:
len(tem)
Out[29]:
127

Create a array with just the index where this duplicate values are

In [33]:
Index_remove = []

for x in tem:
    Index_remove.append(x.index)
These are index where Duplicate email are there
In [44]:
Index_remove[0:12]
Out[44]:
[17, 169, 215, 263, 263, 270, 271, 283, 414, 434, 435, 442]

Iterate over Email

  • chceck is email index exists in index_reove if yes pass else append
  • append value would be your unique emails
In [41]:
newList = []
for c,x1 in enumerate(l1):
    if c in Index_remove:
        pass
    else:
        newList.append(x1)
        
  • create a Pandas DataFrame
In [42]:
myuni = pd.DataFrame(data={
    "UniqueEmail":newList
})
In [43]:
myuni
Out[43]:
UniqueEmail
0 valenciaariel@shelton-peterson.com
1 mark96@andersen.com
2 warrenjames@garrison.info
3 hmitchell@yahoo.com
4 joshuadickerson@gmail.com
... ...
9875 uodom@yahoo.com
9876 jasonprice@gmail.com
9877 jeremy37@hotmail.com
9878 ojohnson@hotmail.com
9879 phillip45@bennett-romero.com

9880 rows × 1 columns

In [ ]:
 

Learn How to Connect to the Glue Data Catalog using AWS Glue Iceberg REST endpoint

gluecat Learn How to Connect to the Glue Data Catalog using AWS Glue Iceberg REST e...