web: Fetch a compressed web file and save it to hdfs uncompressed

dimanche 24 septembre 2017

In spark, I wish to take a file like:

and save it to hdfs extracted as a series of folders + files.

This is conceptually two steps. A) Download the file to hdfs. B) Unzip the file into directory + file structure.

For bonus points, parallel download or extract.

Note that if the file is big enough, the memory of the spark instance may be exhausted.

Note that there may be little temporary space on the Spark worker.

web