I've been trying to web scrape information off the website: http://ift.tt/2iyCLeM
And the information I wanted were in the elements, with the class of "RecogniaEventSummaryBodyLinks"
But when I tried to download the html file and print it, it showed that the html file didn't download correctly. What I mean by this is that when I copied and pasted the whole html text I got from my python code into notepad++ and did CTRL+F to find if these elements were in the html text, they weren't there.
I also tried manually downloading the file directly from the website, but this also didn't work either.
Heres my code (python):
import mechanize
import cookielib
from bs4 import BeautifulSoup
def viewPage(url,proxy,userAgent):
br = mechanize.Browser()
cookieJar = cookielib.LWPCookieJar()
br.set_cookiejar(cookieJar)
br.set_proxies(proxy)
br.addheaders = userAgent
page = br.open(url)
htmlFile = page.read()
for cookie in cookieJar:
print("cookie: " + str(cookie))
print("")
return htmlFile
def ScrapeFigures(url):
html = viewPage(url,proxyAdress,agentStringSample)
soup = BeautifulSoup(html,"html.parser")
info = soup.find("a",attrs={"class":"RecogniaEventSummaryBodyLinks"})
I tried printing out variable info
, but it returned null.
However, after this I tried copy & pasting the python output for the whole soup
variable in the above code into another text file, and saved it as a html file. When I opened this html file with my web browser (Chrome), the elements I needed were on the page, despite not being present in the html file in text format. So I just wondered, is this caused by some sort of JS in the background thats triggered when the page is opened?
My question is, how can I scrape off the elements described above? Is there a way to get around this weird bug?
Thank you for your time
Aucun commentaire:
Enregistrer un commentaire