I have seen plenty of tools so far, as Scrapy or Selenium. Basically the question is not on how to scrape a website, but how to scrape millions of website in a decent amount of time, while respecting robots.txt and internet politeness.
I have collected over a billion of urls so far, but now I need to scrape each of them in order to fetch "title" and "metatags".
Is this possible? And how? Which tool would allow me to scrape several urls without being blocked or banned from a website?
Thanks
Aucun commentaire:
Enregistrer un commentaire