web: Scraping with BS4 but HTML gets messed up when parsed

dimanche 25 mars 2018

Scraping with BS4 but HTML gets messed up when parsed

I'm having trouble scraping a website using BeautifulSoup4 and Python3. I'm using dryscrape to get the HTML since it requires javascript to be enabled in order to be shown (but as far as I know it's never used in the page itself).

This is my code:

from bs4 import BeautifulSoup
import dryscrape
productUrl = "https://www.mercadona.es/detall_producte.php?id=32009"
session = dryscrape.Session()
session.visit(productUrl)
response = session.body()
soup = BeautifulSoup(response, "lxml")
container1 = soup.find("div","contenido").find("dl").find_all("dt")
container3 = soup.find("div","contenido").find_all("td")

Ok, now I want to read container3 content, but:

type(container3)

Returns:

bs4.element.ResultSet

which is the same as type(container1), but it's length it's 0!

So I wanted to know what was I getting to container3 before looking for my tag, so I wrote it to a file.

container3 = soup.find("div","contenido")
soup_file.write(container3.prettify())

And, here is the link to that file: https://pastebin.com/xc22fefJ

It gets all messed up just before the table I want to scrape. I can't understand why, looking at the URL source code from Firefox everything looks fine.

Could you help me, please?

Thank you very much in advance.

web

dimanche 25 mars 2018

Scraping with BS4 but HTML gets messed up when parsed

Aucun commentaire:

Enregistrer un commentaire