lundi 1 mai 2017

Recursion over all html links and index them

iam tasked to iterate over all web portal web pages and index them in hierarchy:

  1. top level -1.1
  2. under top level - 1.1.1
  3. etc. 1.1.2
  4. etc. 1.2.1

My problem is that my code doesnt generate index level 1.2.1 otherwise it would generate 1.2.2, it doesnt generate 1.3.1 or 1.3.2 otherwise it starts with 1.3.3, etc. etc., I know Where the problem is but iam out of any ideas how to solve it. Iam posting my code below. Thanks guys.

private void recursiveLinkSearch(String webPage,int actualRecursionDepth,int numberlink,String previousnumberLink) {
    try {
        Document doc = Jsoup.connect(webPage).get();
        uniqueLinks.add(webPage);
        logger.info(webPage);
        pageIndexes.put(webPage,previousnumberLink.concat((String.valueOf(numberlink)) ));
        String actualNumberLink=previousnumberLink.concat(String.valueOf(numberlink)).concat(".");
        if(getRecursionMode().equals(WebPortalMode.FULL) || actualRecursionDepth<getRecursionMode().getRecursionDepth()) {
            for (Element record : doc.select("a")) {
                String url = record.absUrl("href");
                /** CHECK that the a href link is not to the element on the same page **/
                url = avoidBookMarkedLinks(url);
                if (!uniqueLinks.contains(url)) {
                    /** this would not allow me to to recursively acces to link from other domain **/
                    if (url.contains(getWebPortalDomain())) {
                        recursiveLinkSearch(url, actualRecursionDepth+1,numberlink,actualNumberLink);
                        numberlink++;
                    }
                }
            }
        }

    } catch (Exception e) { }
}




Aucun commentaire:

Enregistrer un commentaire