I´m trying to use scrapy to log into a website, then navigate within than website, and eventully download data from it. Currently I´m stuck in the middle of the navigation part. Here are the things I looked into to solve the problem on my own.
- Datacamp course on Scrapy
- https://www.youtube.com/watch?v=G9Nni6G-iOc
- http://scrapingauthority.com/2016/11/22/scrapy-login/
- https://www.tutorialspoint.com/scrapy/scrapy_following_links.htm
- Relative url to absolute url scrapy
However, I do not seem to connect the dots.
Below is the code I currently use. I manage to login (when I call the "open_in_browser" function, I see that I´m logged in). I also manage to "click" on the first button on the website in the "parse2" part (If I call "open_in_browser" after parse 2, I see that the navigation bar at the top of the website has gone one level deeper.
The main problem is now in the "parse3" part as I cannot navigate another level deeper (Or maybe I can but the "open_in_browser" does not open the webiste anymore, only if I put it after parse or parse 2). My understanding is that I put multiple "parse-functions" after another to navigate through the website. Datacamp says I always need to start with a "start request function" which is what I tried but within the youtube videos etc. I saw evidence that most start directly with parse functions. Using "inspect" on the website for parse 3, I see that this time href is a relative link and I used different methods (See source 5) to navigate to it as I thought this might be the source of error. Right now, I have run out of ideas what it could be which is why I ended up here :)
Thanks in advance for any help or tips.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
from scrapy.crawler import CrawlerProcess
class LoginNeedScraper(scrapy.Spider):
name = "login"
start_urls = ["<some website>"]
def parse(self, response):
loginTicket = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[1]/@value').extract_first()
execution = response.xpath('/html/body/section/div/div/div/div[2]/form/div[3]/input[2]/@value').extract_first()
return FormRequest.from_response(response, formdata={
'loginTicket': loginTicket,
'execution': execution,
'username': '<someusername>',
'password': '<somepassword>'},
callback=self.parse2)
def parse2(self, response):
next_page_url = response.xpath('/html/body/nav/div[2]/ul/li/a/@href').extract_first()
yield scrapy.Request(url=next_page_url, callback=self.parse3)
def parse3(self, response):
next_page_url_2 = response.xpath('/html//div[@class = "headerPanel"]/div[3]/a/@href').extract_first()
absolute_url = response.urljoin(next_page_url_2)
yield scrapy.Request(url=absolute_url, callback=self.start_scraping)
def start_scraping(self, response):
open_in_browser(response)
process = CrawlerProcess()
process.crawl(LoginNeedScraper)
process.start()
Aucun commentaire:
Enregistrer un commentaire