lundi 5 août 2019

Scrapy request doesn't yield full HTML - but Requests library does

I am building a crawl.spider to scrape statutory law data from the following website (https://www.azleg.gov/arsDetail/?title=1). All of my scrapy requests are yielding incomplete html responses, such that when I search for the relevant xpath queries, nothing appears. However, when I use the requests library, the html downloads correctly.

Using XPath tester online, I've verified that my xpath queries should produce the desired content. Using scrapy shell, I've viewed the response object from scrapy in my browser - and it looks just like it does when I'm browsing natively. I've tried enabling middleware for both BeautifulSoup and Selenium, but neither has appeared to work.

Here's my crawl spider

class AZspider(CrawlSpider):
    name = "arizona"
    start_urls = [
        "https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/1/00101.htm",
    ]

    rule = (Rule(LinkExtractor(restrict_xpaths="//div[@class = 'article']"), callback="parse_stats_az", follow=True),)
    def parse_stats_az(self, response):
        statutes = response.xpath("//p")
        yield{
        "statutes":statutes
        }

And here's the code that succsessfuly generated the correct response object

az_leg = requests.get("https://www.azleg.gov/viewdocument/?docName=https://www.azleg.gov/ars/1/00101.htm")




Aucun commentaire:

Enregistrer un commentaire