lundi 18 mai 2015

web scraping with python and beautiful soup: page number request doesn't change for specific website

I am trying to extract some information from a specific page using python and beautiful soup. I figure out that request.get(url) dosn't change page even when I request different pages. In below line I define which page I am scraping, all returns pagenumebr=1, even when I tried with pagenumber=2 it starts from the first page and scraps just the fist page.

    activepage = soup.find('ul', id= 'pagination').li.string
    print "Page Number: " + activepage

I tested my code on other pages and it works fine, but on this specific page I can't loop through different pages. Can anyone tell me what is the exact problem with this page and what is the solution?

import requests
import sys
from bs4 import BeautifulSoup

def trade_spider(max_pages):

page_number = 1

while page_number <= max_pages:
    url = "http://ift.tt/1JTH2jX" + str(page_number) + "&category=festivals_parades"
    source_code = requests.get(url)
    # just get the code, no headers or anything
    plain_text = source_code.text
    # BeautifulSoup objects can be sorted through easy
    soup = BeautifulSoup(plain_text)
    category = soup.find('li', id = 'breadcrumb-label').string
    activepage = soup.find('ul', id= 'pagination').li.string
    print "Page Number: " + activepage
    for mylist in soup.findAll('li', {'class': 'clearfix'}):
        link = mylist.find('a', {'data-ga-label': 'Event Title'})
        if (link is not None):
            href = link.get('href')
            title = link.string  # just the text, not the HTML
            location = mylist.find("div", {"class": "event-meta"}).strong.string
            date = mylist.find("div", {"class": "event-meta"}).span.string
            print(title, category, href, date, location)

    page_number += 1

trade_spider(8)




Aucun commentaire:

Enregistrer un commentaire