I am trying to understand how does monitoring websites for changes works and what are concepts used behind it. I can think of creating a crawler which crawls the whole website, compare the crawled webpage to the one store in the database and overwrite the old page in html if the webpage has been updated or store it in database if it doesn't exist. So here are my questions: 1- How can I compare 2 webpages if they're same? Do i need to compare the string equivalents of the webpages character by character? 2- Do I need to crawl the whole website? let's suppose the html pages of a website are 5 Gb in size and i want to detect for changes in the website on hourly basis so crawling and downloading a 5 Gb data every hour is going to eat up a lot of bandwidth.
I can write code, i just want to know the general practice used for monitoring a website.
Thanks alot.
Aucun commentaire:
Enregistrer un commentaire