jeudi 17 janvier 2019

Scraping titles and metatags from millions of urls

I have seen plenty of tools so far, as Scrapy or Selenium. Basically the question is not on how to scrape a website, but how to scrape millions of website in a decent amount of time, while respecting robots.txt and internet politeness.

I have collected over a billion of urls so far, but now I need to scrape each of them in order to fetch "title" and "metatags".

Is this possible? And how? Which tool would allow me to scrape several urls without being blocked or banned from a website?

Thanks




Aucun commentaire:

Enregistrer un commentaire