jeudi 25 juillet 2019

Python Scrapy returns different url

I am trying to scrape booking.com with scrapy. The problem occurs when I try to implement pagination. I'm trying to get URL to the next page, but scrapy retrieves me different URL(I get it through shell), which resulst in "page not found" when I try to paste into Chrome. And when I try to put it into JSON, it doesn't retrieve any URL for pagination. Anyone has any suggestions? Maybe I should shorten the first URL.

I tried to set a "canonicalize=False" rule, but it didn't do anything.

-- coding: utf-8 --

import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor

class BookingSpider(scrapy.Spider):

name = "BookingScrape"
start_urls = [
    'https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaFCIAQGYAQm4ARfIAQzYAQHoAQH4AQuIAgGoAgO4Aomk4OkFwAIB&lang=en-gb&sid=163b31478fa340d233204d1dcbb259ec&sb=1&src=searchresults&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Fsearchresults.en-gb.html%3Faid%3D304142%3Blabel%3Dgen173nr-1FCAEoggI46AdIM1gEaFCIAQGYAQm4ARfIAQzYAQHoAQH4AQuIAgGoAgO4Aomk4OkFwAIB%3Bsid%3D163b31478fa340d233204d1dcbb259ec%3Btmpl%3Dsearchresults%3Bcheckin_month%3D9%3Bcheckin_monthday%3D10%3Bcheckin_year%3D2019%3Bcheckout_month%3D9%3Bcheckout_monthday%3D12%3Bcheckout_year%3D2019%3Bclass_interval%3D1%3Bdest_id%3D15754%3Bdest_type%3Dlandmark%3Bdtdisc%3D0%3Bfrom_sf%3D1%3Bgroup_adults%3D2%3Bgroup_children%3D0%3Binac%3D0%3Bindex_postcard%3D0%3Blabel_click%3Dundef%3Blandmark%3D15754%3Bno_rooms%3D1%3Boffset%3D0%3Bpostcard%3D0%3Broom1%3DA%252CA%3Bsb_price_type%3Dtotal%3Bshw_aparth%3D1%3Bslp_r_match%3D0%3Bsrc%3Dsearchresults%3Bsrc_elem%3Dsb%3Bsrpvid%3Da3bf35ea467d01b9%3Bss%3DKensington%2520High%2520Street%3Bss_all%3D0%3Bssb%3Dempty%3Bsshis%3D0%3Bssne%3DKensington%2520High%2520Street%3Bssne_untouched%3DKensington%2520High%2520Street%26%3B&ss=Kensington+High+Street&is_ski_area=0&ssne=Kensington+High+Street&ssne_untouched=Kensington+High+Street&landmark=15754&checkin_year=2019&checkin_month=9&checkin_monthday=10&checkout_year=2019&checkout_month=9&checkout_monthday=12&group_adults=2&group_children=0&no_rooms=1&from_sf=1',
]


rules = (
    Rule(LinkExtractor(allow=('CINE&OBRA&-1&29',), canonicalize=False), callback='parse_item', follow=False),
)

def parse(self, response):
    for hotel in response.css("h3.sr-hotel__title"):
        yield {
            'hotel_name': hotel.css("span.sr-hotel__name::text").extract_first(),
            'link': hotel.css("h3.sr-hotel__title a::attr(href)").extract_first(),
            'pagination' : hotel.css('li.bui-pagination__item bui-pagination__next-arrow a::attr(href)').extract_first()
        }


    for a in response.css('li.bui-pagination__item.bui-pagination__next-arrow a'):
        yield response.follow(a, callback=self.parse)

URL recieved through shell and which doesn't take me to next page : https://www.booking.com/searchresults.en-gb.html" data-page-next class="bui-pagination__link paging-next ga_sr_gotopage_2_207" title="Next page">\n

Expected URL: https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaFCIAQGYAQm4ARfIAQzYAQHoAQH4AQuIAgGoAgO4Aomk4OkFwAIB&sid=163b31478fa340d233204d1dcbb259ec&tmpl=searchresults&checkin_month=9&checkin_monthday=10&checkin_year=2019&checkout_month=9&checkout_monthday=12&checkout_year=2019&class_interval=1&dest_id=15754&dest_type=landmark&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&landmark=15754&no_rooms=1&postcard=0&room1=A%2CA&sb_price_type=total&shw_aparth=1&slp_r_match=0&src=searchresults&src_elem=sb&srpvid=7246436ad3b000a5&ss=Kensington%20High%20Street&ss_all=0&ssb=empty&sshis=0&ssne=Kensington%20High%20Street&ssne_untouched=Kensington%20High%20Street&rows=15&offset=15




Aucun commentaire:

Enregistrer un commentaire