I'm trying to web scrape data from donedeal.ie advertisement website. I thought i was doing a good job when I've realise that some images don't match their ad link and description. I figured out why is it happening but I can't come up with solution.
Basically the div class name changes when image is not added by the user and image from the next add is shifted into that space. At the end of web scraping i have images shifted 37 places....
public static void getDoneDealAll() throws IOException
{
String BASE_URL= "http://ift.tt/2m3laMV";
int no_of_pages=3;
for(int i=1;i<=no_of_pages;i++)
{
String new_url=BASE_URL+p;
Document d = Jsoup.connect(new_url).timeout(6000).userAgent("Chrome/41.0.2228.0").get();
Elements ele = d.select("ul.card-collection");
for (Element element : ele)
{
for (Element title : element.select(".card__body-title"))
{
//System.out.println(title.text());
bikesTitles.add(title.text());
}
for (Element img : element.select(".card__media img"))
{
if(img.select("div:not([scr])").isEmpty())
{
bikesLinks.add("http://ift.tt/2n3b4k2");
System.out.println("http://ift.tt/2n3b4k2");
}
else
{
bikesLinks.add(img.attr("data-lazy-img") + img.attr("src"));
System.out.println(img.attr("data-lazy-img") + img.attr("src"));
}
}
for (Element title : element.select(".card-item a"))
{
//System.out.println((title.attr("href")));
bikesAdLinks.add((title.attr("href")));
bikesTypes.add("Unknown");
}
}
p = p + 28;
System.out.println(new_url);
}
}
Aucun commentaire:
Enregistrer un commentaire