I am trying to scrape booking.com with scrapy. The problem occurs when I try to implement pagination. I'm trying to get URL to the next page, but scrapy retrieves me different URL(I get it through shell), which resulst in "page not found" when I try to paste into Chrome. And when I try to put it into JSON, it doesn't retrieve any URL for pagination. Anyone has any suggestions? Maybe I should shorten the first URL.
I tried to set a "canonicalize=False" rule, but it didn't do anything.
-- coding: utf-8 --
import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor
class BookingSpider(scrapy.Spider):
name = "BookingScrape"
start_urls = [
rules = (
Rule(LinkExtractor(allow=('CINE&OBRA&-1&29',), canonicalize=False), callback='parse_item', follow=False),
def parse(self, response):
for hotel in response.css("h3.sr-hotel__title"):
yield {
'hotel_name': hotel.css("span.sr-hotel__name::text").extract_first(),
'link': hotel.css("h3.sr-hotel__title a::attr(href)").extract_first(),
'pagination' : hotel.css('li.bui-pagination__item bui-pagination__next-arrow a::attr(href)').extract_first()
for a in response.css('li.bui-pagination__item.bui-pagination__next-arrow a'):
yield response.follow(a, callback=self.parse)
URL recieved through shell and which doesn't take me to next page : https://www.booking.com/searchresults.en-gb.html" data-page-next class="bui-pagination__link paging-next ga_sr_gotopage_2_207" title="Next page">\n
Aucun commentaire:
Enregistrer un commentaire