samedi 1 juin 2019

How to make scrape data only from article on some web-page using and tags

I have some issues with my scraper. I don't know, how to scrap only text without any service text like "Log in" "edit" and so on. I'm doing summarize project for my diploma project and it scrap articles from web-pages and files.

First, i've tried only to scrap "p" tags, but on some sites its not work correctly, because on these sites no "p" tags, where articles can be. Then i tried to scrap both "p" and "div". But now it scraps absolutly everything, even things, i dont want to see in my text for summarize.

Here is module, which used for scraping text and articles from web-page:

def scrap(url):
    try:
        req = urllib.request.Request(url, headers={
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
        })
    except:
        return False

    scraped_data = urllib.request.urlopen(req)
    article = scraped_data.read()

    parsed_article = bs.BeautifulSoup(article, 'lxml')

    paragraphs = parsed_article.find_all('p')
    text_in_no_p = parsed_article.find_all("div")

    article_text = ""

    for p in paragraphs:
        article_text += p.text

    for p in text_in_no_p:
        article_text += p.text

    return article_text

I expect to see scraped text without any service text like in example above. At least any text from main block of web page. I am not really good at web-scraping and know only some methods to do web scraping. Thank you for any advice.




Aucun commentaire:

Enregistrer un commentaire