samedi 21 mars 2020

Which is the best approach for extracting the text from a web page?

   public void read(String matchLink) {

        try {
            String page = matchLink;

            //Connecting to the web page
            Connection conn = Jsoup.connect(page);

            //executing the get request
            Document doc = conn.get();

            //Retrieving the contents (body) of the web page
            String webPageText = doc.body().text();
            String htmlCode = doc.body().html();

            this.webPageText = webPageText;
            this.htmlCode = htmlCode;

        } catch (IOException ex) {
            System.out.println(ex+" in read method");
        }
    }

I've been using Document.body().text() method, in order to get the raw text of a specific webpage. But it appears that it does not work on some webpages. Can you please recommend me a different approach? I've been browsing google, but I couldn't get a solid method. Thank you




Aucun commentaire:

Enregistrer un commentaire