web: Python Webscraping with BS4 and html not downloading correctly

vendredi 30 décembre 2016

Python Webscraping with BS4 and html not downloading correctly

I've been trying to web scrape information off the website: http://ift.tt/2iyCLeM

And the information I wanted were in the elements, with the class of "RecogniaEventSummaryBodyLinks"

But when I tried to download the html file and print it, it showed that the html file didn't download correctly. What I mean by this is that when I copied and pasted the whole html text I got from my python code into notepad++ and did CTRL+F to find if these elements were in the html text, they weren't there.

I also tried manually downloading the file directly from the website, but this also didn't work either.

Heres my code (python):

import mechanize
import cookielib
from bs4 import BeautifulSoup

def viewPage(url,proxy,userAgent):
    br = mechanize.Browser()
    cookieJar = cookielib.LWPCookieJar()
    br.set_cookiejar(cookieJar)
    br.set_proxies(proxy)
    br.addheaders = userAgent
    page = br.open(url)
    htmlFile = page.read()
    for cookie in cookieJar:
          print("cookie:  " + str(cookie))
          print("")
    return htmlFile

def ScrapeFigures(url):
    html = viewPage(url,proxyAdress,agentStringSample)
    soup = BeautifulSoup(html,"html.parser")
    info = soup.find("a",attrs={"class":"RecogniaEventSummaryBodyLinks"})

I tried printing out variable info, but it returned null.

However, after this I tried copy & pasting the python output for the whole soup variable in the above code into another text file, and saved it as a html file. When I opened this html file with my web browser (Chrome), the elements I needed were on the page, despite not being present in the html file in text format. So I just wondered, is this caused by some sort of JS in the background thats triggered when the page is opened?

My question is, how can I scrape off the elements described above? Is there a way to get around this weird bug?

Thank you for your time

web

vendredi 30 décembre 2016

Python Webscraping with BS4 and html not downloading correctly

Aucun commentaire:

Enregistrer un commentaire