dimanche 24 septembre 2017

Fetch a compressed web file and save it to hdfs uncompressed

In spark, I wish to take a file like:

http://ift.tt/2hs2W7l

and save it to hdfs extracted as a series of folders + files.

This is conceptually two steps. A) Download the file to hdfs. B) Unzip the file into directory + file structure.

For bonus points, parallel download or extract.

Note that if the file is big enough, the memory of the spark instance may be exhausted.

Note that there may be little temporary space on the Spark worker.




Aucun commentaire:

Enregistrer un commentaire