vendredi 30 décembre 2016

Python Webscraping with BS4 and html not downloading correctly

I've been trying to web scrape information off the website: http://ift.tt/2iyCLeM

And the information I wanted were in the elements, with the class of "RecogniaEventSummaryBodyLinks"

But when I tried to download the html file and print it, it showed that the html file didn't download correctly. What I mean by this is that when I copied and pasted the whole html text I got from my python code into notepad++ and did CTRL+F to find if these elements were in the html text, they weren't there.

I also tried manually downloading the file directly from the website, but this also didn't work either.

Heres my code (python):

import mechanize
import cookielib
from bs4 import BeautifulSoup

def viewPage(url,proxy,userAgent):
    br = mechanize.Browser()
    cookieJar = cookielib.LWPCookieJar()
    br.set_cookiejar(cookieJar)
    br.set_proxies(proxy)
    br.addheaders = userAgent
    page = br.open(url)
    htmlFile = page.read()
    for cookie in cookieJar:
          print("cookie:  " + str(cookie))
          print("")
    return htmlFile

def ScrapeFigures(url):
    html = viewPage(url,proxyAdress,agentStringSample)
    soup = BeautifulSoup(html,"html.parser")
    info = soup.find("a",attrs={"class":"RecogniaEventSummaryBodyLinks"})  

I tried printing out variable info, but it returned null.

However, after this I tried copy & pasting the python output for the whole soup variable in the above code into another text file, and saved it as a html file. When I opened this html file with my web browser (Chrome), the elements I needed were on the page, despite not being present in the html file in text format. So I just wondered, is this caused by some sort of JS in the background thats triggered when the page is opened?

My question is, how can I scrape off the elements described above? Is there a way to get around this weird bug?

Thank you for your time




Aucun commentaire:

Enregistrer un commentaire