I am trying to scrape booking.com with scrapy. The problem occurs when I try to implement pagination. I'm trying to get URL to the next page, but scrapy retrieves me different URL(I get it through shell), which resulst in "page not found" when I try to paste into Chrome. And when I try to put it into JSON, it doesn't retrieve any URL for pagination. Anyone has any suggestions? Maybe I should shorten the first URL.
I tried to set a "canonicalize=False" rule, but it didn't do anything.
-- coding: utf-8 --
import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor
class BookingSpider(scrapy.Spider):
name = "BookingScrape"
start_urls = [
'https://www.booking.com/searchresults.en-gb.html?aid=304142&label=gen173nr-1FCAEoggI46AdIM1gEaFCIAQGYAQm4ARfIAQzYAQHoAQH4AQuIAgGoAgO4Aomk4OkFwAIB&lang=en-gb&sid=163b31478fa340d233204d1dcbb259ec&sb=1&src=searchresults&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Fsearchresults.en-gb.html%3Faid%3D304142%3Blabel%3Dgen173nr-1FCAEoggI46AdIM1gEaFCIAQGYAQm4ARfIAQzYAQHoAQH4AQuIAgGoAgO4Aomk4OkFwAIB%3Bsid%3D163b31478fa340d233204d1dcbb259ec%3Btmpl%3Dsearchresults%3Bcheckin_month%3D9%3Bcheckin_monthday%3D10%3Bcheckin_year%3D2019%3Bcheckout_month%3D9%3Bcheckout_monthday%3D12%3Bcheckout_year%3D2019%3Bclass_interval%3D1%3Bdest_id%3D15754%3Bdest_type%3Dlandmark%3Bdtdisc%3D0%3Bfrom_sf%3D1%3Bgroup_adults%3D2%3Bgroup_children%3D0%3Binac%3D0%3Bindex_postcard%3D0%3Blabel_click%3Dundef%3Blandmark%3D15754%3Bno_rooms%3D1%3Boffset%3D0%3Bpostcard%3D0%3Broom1%3DA%252CA%3Bsb_price_type%3Dtotal%3Bshw_aparth%3D1%3Bslp_r_match%3D0%3Bsrc%3Dsearchresults%3Bsrc_elem%3Dsb%3Bsrpvid%3Da3bf35ea467d01b9%3Bss%3DKensington%2520High%2520Street%3Bss_all%3D0%3Bssb%3Dempty%3Bsshis%3D0%3Bssne%3DKensington%2520High%2520Street%3Bssne_untouched%3DKensington%2520High%2520Street%26%3B&ss=Kensington+High+Street&is_ski_area=0&ssne=Kensington+High+Street&ssne_untouched=Kensington+High+Street&landmark=15754&checkin_year=2019&checkin_month=9&checkin_monthday=10&checkout_year=2019&checkout_month=9&checkout_monthday=12&group_adults=2&group_children=0&no_rooms=1&from_sf=1',
]
rules = (
Rule(LinkExtractor(allow=('CINE&OBRA&-1&29',), canonicalize=False), callback='parse_item', follow=False),
)
def parse(self, response):
for hotel in response.css("h3.sr-hotel__title"):
yield {
'hotel_name': hotel.css("span.sr-hotel__name::text").extract_first(),
'link': hotel.css("h3.sr-hotel__title a::attr(href)").extract_first(),
'pagination' : hotel.css('li.bui-pagination__item bui-pagination__next-arrow a::attr(href)').extract_first()
}
for a in response.css('li.bui-pagination__item.bui-pagination__next-arrow a'):
yield response.follow(a, callback=self.parse)
URL recieved through shell and which doesn't take me to next page : https://www.booking.com/searchresults.en-gb.html" data-page-next class="bui-pagination__link paging-next ga_sr_gotopage_2_207" title="Next page">\n
Aucun commentaire:
Enregistrer un commentaire