mercredi 25 mars 2015

Get the hosts of many websites and services

My question would be how to write the code to crawl this?


A website provides looking up which company is hosting the particular website or service. http://ift.tt/1ItVL4C http://ift.tt/1CXzOw5


Ex. Put fbcdn.net and it gives Facebook. Put paypal.com and it gives eBay


I have more than 100000 websites and want to see the corresponding companies. Now I'm looking at Jsoup, is it the solution? So that I can



For(String website : websiteSet){
url = "http://ift.tt/1ItVL4G" + website
Document doc = Jsoup.connect(url).get();
String company = doc.getHost();
Map.put(website, company);
}


Any suggestion? Because I hear that the website being crawled might block my request since it sends too many requests in few minutes?





Aucun commentaire:

Enregistrer un commentaire