My question would be how to write the code to crawl this?
A website provides looking up which company is hosting the particular website or service. http://ift.tt/1ItVL4C http://ift.tt/1CXzOw5
Ex. Put fbcdn.net and it gives Facebook. Put paypal.com and it gives eBay
I have more than 100000 websites and want to see the corresponding companies. Now I'm looking at Jsoup, is it the solution? So that I can
For(String website : websiteSet){
url = "http://ift.tt/1ItVL4G" + website
Document doc = Jsoup.connect(url).get();
String company = doc.getHost();
Map.put(website, company);
}
Any suggestion? Because I hear that the website being crawled might block my request since it sends too many requests in few minutes?
Aucun commentaire:
Enregistrer un commentaire