dimanche 25 mars 2018

Scraping with BS4 but HTML gets messed up when parsed

I'm having trouble scraping a website using BeautifulSoup4 and Python3. I'm using dryscrape to get the HTML since it requires javascript to be enabled in order to be shown (but as far as I know it's never used in the page itself).

This is my code:

from bs4 import BeautifulSoup
import dryscrape
productUrl = "https://www.mercadona.es/detall_producte.php?id=32009"
session = dryscrape.Session()
session.visit(productUrl)
response = session.body()
soup = BeautifulSoup(response, "lxml")
container1 = soup.find("div","contenido").find("dl").find_all("dt")
container3 = soup.find("div","contenido").find_all("td")

Ok, now I want to read container3 content, but:

type(container3)

Returns:

bs4.element.ResultSet

which is the same as type(container1), but it's length it's 0!

So I wanted to know what was I getting to container3 before looking for my tag, so I wrote it to a file.

container3 = soup.find("div","contenido")
soup_file.write(container3.prettify())

And, here is the link to that file: https://pastebin.com/xc22fefJ

It gets all messed up just before the table I want to scrape. I can't understand why, looking at the URL source code from Firefox everything looks fine.

Could you help me, please?

Thank you very much in advance.




Aucun commentaire:

Enregistrer un commentaire