mardi 29 décembre 2015

wget -- how to bypass robot.txt when mirroring website

I'm trying to get a copy of a website "http://www.xcn.one/", which seems to be interesting to me. So as usual, I tried:

wget -r -k -p -np -robots=off http://www.xcn.one/

But it does not work, all I get is only the index.html

However, all the articles are accessible from firefox (normal browsing by click the links)

so is there any explanations on what kind of mechanism this site is using to prevent site mirroring, and how to download it? Thanks.




Aucun commentaire:

Enregistrer un commentaire