I am trying to extract some information from a specific page using python and beautiful soup. I figure out that request.get(url) dosn't change page even when I request different pages. In below line I define which page I am scraping, all returns pagenumebr=1, even when I tried with pagenumber=2 it starts from the first page and scraps just the fist page.
activepage = soup.find('ul', id= 'pagination').li.string
print "Page Number: " + activepage
I tested my code on other pages and it works fine, but on this specific page I can't loop through different pages. Can anyone tell me what is the exact problem with this page and what is the solution?
import requests
import sys
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page_number = 1
while page_number <= max_pages:
url = "http://ift.tt/1JTH2jX" + str(page_number) + "&category=festivals_parades"
source_code = requests.get(url)
# just get the code, no headers or anything
plain_text = source_code.text
# BeautifulSoup objects can be sorted through easy
soup = BeautifulSoup(plain_text)
category = soup.find('li', id = 'breadcrumb-label').string
activepage = soup.find('ul', id= 'pagination').li.string
print "Page Number: " + activepage
for mylist in soup.findAll('li', {'class': 'clearfix'}):
link = mylist.find('a', {'data-ga-label': 'Event Title'})
if (link is not None):
href = link.get('href')
title = link.string # just the text, not the HTML
location = mylist.find("div", {"class": "event-meta"}).strong.string
date = mylist.find("div", {"class": "event-meta"}).span.string
print(title, category, href, date, location)
page_number += 1
trade_spider(8)
Aucun commentaire:
Enregistrer un commentaire