1

I do have n number of .zip files on s3, which I want to process and extract some data out of them. zip files contains a single json file. In spar we can read .gz files, but I didn't find any way to read data within .zip files. Can someone please help me out how can I process large zip files over spark using python. I came across some options like newAPIHadoopFile, but didn't get any luck with them, nor found way to implement them in pyspark. Please note the zip files are >1G, some are of 20G as well.

Sandie
  • 111
  • 2
  • Do you have enough RAM to load the compressed archive? Let alone the _uncompressed_ JSON... Why do you need to load it into memory, why couldn't you store it on disk temporarily? – Attie Mar 28 '19 at 13:07
  • RAM is not a problem for me, I 'm using EMR, and can use bigger instance type, having RAM > 50. But can you please do let me know how to do this.. and what do you mean by storing it temporarily on disk ? Do I need to un-compress the zip file first ? Request you to post some sample. – Sandie Mar 28 '19 at 13:13
  • Programming questions belong on [so]. You should show your working so far and where you are having trouble. – Mokubai Mar 28 '19 at 14:20

0 Answers0