Saturday, March 21, 2020

How to remove Duplicate Email address from Two Csv File using Python | Algorithm

EmailsMaster

Soumil Nitin Shah

Bachelor in Electronic Engineering | Masters in Electrical Engineering | Master in Computer Engineering |

Question :

Your company gave you two csv file with more than 10,000 Email address they want you to send them a file which has common email address between both of them and also a file which has unique email betweeen both the csv files you cannot use excel use python to solve this challenge

In [10]:
df1.head(4)
Out[10]:
Emails
0 linjason@gmail.com
1 howellian@heath.biz
2 angelaharris@sloan-marshall.net
3 ngilbert@jones.biz
In [13]:
df2.head(4)
Out[13]:
Emails
0 osimmons@yahoo.com
1 martinsimon@gmail.com
2 yferrell@smith.info
3 brittany18@hotmail.com

Algorithms and Steps

  • Convert the email into array
In [35]:
l1 = df1["Emails"].to_list()
l2 = df2["Emails"].to_list()
In [22]:
len(l1)
Out[22]:
10000
In [23]:
len(l2)
Out[23]:
10000
Create a list of Duplicate Email Address along with their index
In [24]:
Duplicate = collections.namedtuple("Emails", "index email")
tem =[]
In [25]:
for i in range(0, len(l1)):
    for j in range(0, len(l2)):
        if (l1[i] == l2[j]):
            tem.append(Duplicate(i, l1[i]))
        else:
            pass
        
In [29]:
len(tem)
Out[29]:
127

Create a array with just the index where this duplicate values are

In [33]:
Index_remove = []

for x in tem:
    Index_remove.append(x.index)
These are index where Duplicate email are there
In [44]:
Index_remove[0:12]
Out[44]:
[17, 169, 215, 263, 263, 270, 271, 283, 414, 434, 435, 442]

Iterate over Email

  • chceck is email index exists in index_reove if yes pass else append
  • append value would be your unique emails
In [41]:
newList = []
for c,x1 in enumerate(l1):
    if c in Index_remove:
        pass
    else:
        newList.append(x1)
        
  • create a Pandas DataFrame
In [42]:
myuni = pd.DataFrame(data={
    "UniqueEmail":newList
})
In [43]:
myuni
Out[43]:
UniqueEmail
0 valenciaariel@shelton-peterson.com
1 mark96@andersen.com
2 warrenjames@garrison.info
3 hmitchell@yahoo.com
4 joshuadickerson@gmail.com
... ...
9875 uodom@yahoo.com
9876 jasonprice@gmail.com
9877 jeremy37@hotmail.com
9878 ojohnson@hotmail.com
9879 phillip45@bennett-romero.com

9880 rows × 1 columns

In [ ]:
 

Learn How to configure your Spark Session to Join Managed (S3 Table Buckets) and Unmanaged Iceberg Tables | Hands on Labs

test-tble-bucket-joins Learn How to configure your Spark Session to Join Managed (S...