dimanche 29 décembre 2019

Python - Scrapy - Navigating through a website

I´m trying to use scrapy to log into a website, then navigate within than website, and eventully download data from it. Currently I´m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.

  1. Datacamp course on Scrapy
  2. https://www.youtube.com/watch?v=G9Nni6G-iOc
  3. http://scrapingauthority.com/2016/11/22/scrapy-login/
  4. https://www.tutorialspoint.com/scrapy/scrapy_following_links.htm
  5. Relative url to absolute url scrapy

However, I do not seem to connect the dots.

Below is the code I currently use. I manage to login (when I call the "open_in_browser" function, I see that I´m logged in). I also manage to "click" on the first button on the website in the "parse2" part (If I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.

The main problem is now in the "parse3" part as I cannot navigate another level deeper (Or maybe I can but the "open_in_browser" does not open the webiste anymore, only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website. Datacamp says I always need to start with a "start request function" which is what I tried but within the youtube videos etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error. Right now, I have run out of ideas what it could be which is why I ended up here :)

Thanks in advance for any help or tips.

import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess


class LoginNeedScraper(scrapy.Spider):
    name = "login"
    start_urls = ["<some website>"]

    def parse(self, response):
        loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/@value').extract_first()
        execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/@value').extract_first()
        return FormRequest.from_response(response, formdata={
                                                   'loginTicket': loginTicket,
                                                   'execution': execution,
                                                   'username': '<someusername>', 
                                                   'password': '<somepassword>'},
                                         callback=self.parse2)

    def parse2(self, response):
        next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/@href').extract_first()
        yield scrapy.Request(url=next_page_url, callback=self.parse3)

    def parse3(self, response):
        next_page_url_2 = response.xpath('/html//div[@class = "headerPanel"]/div[3]/a/@href').extract_first()
        absolute_url = response.urljoin(next_page_url_2)
        yield scrapy.Request(url=absolute_url, callback=self.start_scraping)

    def start_scraping(self, response):
        open_in_browser(response)

process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start() 



Aucun commentaire:

Enregistrer un commentaire