The Question: Is there a way to extract the body text of a webpage like in the way a user would see it? For example, on Wikipedia this might be the main article, or on a newssite, this would be the news-story.
What I've tried: I've tried to make a program to look at the HTML and extract the biggest block of text possible. This doesn't usually work though, as on many websites, the paragraphs are separated by <div>s, <p> tags, and are often interspersed with hyperlinks.
Potential leads? When you highlight text, you can often start at the beginning of the body and go down to the end, which will usually highlight the relevant piece. Also, maybe there's a way to match formatting and form blocks of text that way, as body paragraphs often have the same font, size, color, etc.
I'm grateful for any and all help I get, and thanks a lot!
Aucun commentaire:
Enregistrer un commentaire