I have been working with web crawler Heritrix recently in my company where i work for and after a while searching and testing it I can't find how to solve our need.
We want to run heritrix automatically in cron everyday to crawl a list of webpages and what we want to do is to check if any link of that webs are pointing to webs on our domains list. The difficult part and don't find the way is to log all the trace to that link that points to one our domains.
As the job's log file stores all the links with some information but not the trace. An example is run an script when job is done to grep brazzers that is a domain in the list, so if it finds "brazzers" in the crawl log it should show as a result in another log with the whole trace from start to end:
2015-10-25T20:18:58.369Z 200 91 http://ift.tt/1GsVEKt XLEP http://ift.tt/1MODVtx text/plain #021 20151025201857643+726 sha1:CPA63O5POU3CVLCH3VDDIMBJCCWRVLPC - -
Is it possible to do this?, or other way?. Feel very stupid with this stuff and i am not very good in programming
Thank you very much in advance
Enrique.
Aucun commentaire:
Enregistrer un commentaire