Saturday, March 14, 2020

Reading CSV | JSON > 5gb as Pandas Dataframe¶

Untitled1

Reading CSV | JSON > 5gb as Pandas Dataframe

In [1]:
import pandas as pd
import os
import sys
In [2]:
for x in os.listdir():
    print(x)
.ipynb_checkpoints
AWS SOUMIL CREDS
Code
converted-keys.ppk
netflix_titles.csv
puttygen (1).exe
Untitled.ipynb
Untitled1.ipynb
ytdtest.pem
In [11]:
ChunkSize = 10
for chunk in pd.read_csv("netflix_titles.csv", chunksize=ChunkSize):
    print(chunk.shape)
    print("="*66)
    print(chunk.head(2))
    print("="*66)
    break
(10, 12)
==================================================================
    show_id   type                                    title  \
0  81145628  Movie  Norm of the North: King Sized Adventure   
1  80117401  Movie               Jandino: Whatever it Takes   

                   director  \
0  Richard Finn, Tim Maltby   
1                       NaN   

                                                cast  \
0  Alan Marriott, Andrew Toth, Brian Dobson, Cole...   
1                                   Jandino Asporaat   

                                    country         date_added  release_year  \
0  United States, India, South Korea, China  September 9, 2019          2019   
1                            United Kingdom  September 9, 2016          2016   

  rating duration                           listed_in  \
0  TV-PG   90 min  Children & Family Movies, Comedies   
1  TV-MA   94 min                     Stand-Up Comedy   

                                         description  
0  Before planning an awesome wedding for his gra...  
1  Jandino Asporaat riffs on the challenges of ra...  
==================================================================
In [12]:
MyList = []
ChunkSize = 10
for chunk in pd.read_csv("netflix_titles.csv", chunksize=ChunkSize):
    MyList.append(chunk)

What we did

  • [df1, df2, df3 ......, dfN]

  • we had 6234/Chunk size

  • 624
In [13]:
len(MyList)
Out[13]:
624
In [15]:
df = pd.read_csv("netflix_titles.csv")
df.shape
Out[15]:
(6234, 12)

Let us see one of the chunk

In [17]:
MyList[0]
Out[17]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J... United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise NaN Will Friedle, Darren Criss, Constance Zimmer, ... United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins... United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...
5 80163890 TV Show Apaches NaN Alberto Ammann, Eloy Azorín, Verónica Echegui,... Spain September 8, 2017 2016 TV-MA 1 Season Crime TV Shows, International TV Shows, Spanis... A young journalist is forced into a life of cr...
6 70304989 Movie Automata Gabe Ibáñez Antonio Banderas, Dylan McDermott, Melanie Gri... Bulgaria, United States, Spain, Canada September 8, 2017 2014 R 110 min International Movies, Sci-Fi & Fantasy, Thrillers In a dystopian future, an insurance adjuster f...
7 80164077 Movie Fabrizio Copano: Solo pienso en mi Rodrigo Toro, Francisco Schultz Fabrizio Copano Chile September 8, 2017 2017 TV-MA 60 min Stand-Up Comedy Fabrizio Copano takes audience participation t...
8 80117902 TV Show Fire Chasers NaN NaN United States September 8, 2017 2017 TV-MA 1 Season Docuseries, Science & Nature TV As California's 2016 fire season rages, brave ...
9 70304990 Movie Good People Henrik Ruben Genz James Franco, Kate Hudson, Tom Wilkinson, Omar... United States, United Kingdom, Denmark, Sweden September 8, 2017 2014 R 90 min Action & Adventure, Thrillers A struggling couple can't believe their luck w...

Combining Chunks

What we did

  • [df1, df2, df3 ......, dfN]

  • we had 6234/Chunk size

  • 624
In [18]:
df1 = pd.concat(MyList, axis=0)
In [20]:
df1.shape
Out[20]:
(6234, 12)
In [21]:
df1.head()
Out[21]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 81145628 Movie Norm of the North: King Sized Adventure Richard Finn, Tim Maltby Alan Marriott, Andrew Toth, Brian Dobson, Cole... United States, India, South Korea, China September 9, 2019 2019 TV-PG 90 min Children & Family Movies, Comedies Before planning an awesome wedding for his gra...
1 80117401 Movie Jandino: Whatever it Takes NaN Jandino Asporaat United Kingdom September 9, 2016 2016 TV-MA 94 min Stand-Up Comedy Jandino Asporaat riffs on the challenges of ra...
2 70234439 TV Show Transformers Prime NaN Peter Cullen, Sumalee Montano, Frank Welker, J... United States September 8, 2018 2013 TV-Y7-FV 1 Season Kids' TV With the help of three human allies, the Autob...
3 80058654 TV Show Transformers: Robots in Disguise NaN Will Friedle, Darren Criss, Constance Zimmer, ... United States September 8, 2018 2016 TV-Y7 1 Season Kids' TV When a prison ship crash unleashes hundreds of...
4 80125979 Movie #realityhigh Fernando Lebrija Nesta Cooper, Kate Walsh, John Michael Higgins... United States September 8, 2017 2017 TV-14 99 min Comedies When nerdy high schooler Dani finally attracts...
In [ ]:
 

2 comments:

  1. This is so helpful. Thank you!

    ReplyDelete
  2. This is not working in pycharm. i have a 150 MB file nd the pycharm is not creating its dataframe even after reading chunkwise

    ReplyDelete

Learn How to Read Hudi Tables on S3 Locally in Your PySpark Environment | Essential Packages You Need to Use

sprksql-2 Learn How to Read Hudi Tables on S3 Locally in Your PySpark Environment |...