How to read large zip files in pyspark

Asked Mar 28 '19 at 12:37

Active Mar 28 '19 at 12:37

Viewed 1,264 times

I do have n number of .zip files on s3, which I want to process and extract some data out of them. zip files contains a single json file. In spar we can read .gz files, but I didn't find any way to read data within .zip files. Can someone please help me out how can I process large zip files over spark using python. I came across some options like newAPIHadoopFile, but didn't get any luck with them, nor found way to implement them in pyspark. Please note the zip files are >1G, some are of 20G as well.

asked Mar 28 '19 at 12:37

Sandie

Do you have enough RAM to load the compressed archive? Let alone the _uncompressed_ JSON... Why do you need to load it into memory, why couldn't you store it on disk temporarily? – Attie Mar 28 '19 at 13:07
RAM is not a problem for me, I 'm using EMR, and can use bigger instance type, having RAM > 50. But can you please do let me know how to do this.. and what do you mean by storing it temporarily on disk ? Do I need to un-compress the zip file first ? Request you to post some sample. – Sandie Mar 28 '19 at 13:13
Programming questions belong on [so]. You should show your working so far and where you are having trouble. – Mokubai Mar 28 '19 at 14:20

How to read large zip files in pyspark

0 Answers0