samedi 28 avril 2018

Scrapy Crawler gets terminated at random pages

I'm new to Scrapy. I'm crawling the r/india subreddit using a recursive parser to store the title, upvotes and URLs of each thread. It works all fine but the Scraper ends unexpectedly with a weird error that shows:

2018-04-29 00:01:12 [scrapy.core.scraper] ERROR: Spider error processing 
<GET https://www.reddit.com/r/india/?count=50&after=t3_8fh5nv> (referer: 
https://www.reddit.com/r/india/?count=25&after=t3_8fiqd5)
Traceback (most recent call last):
File "Z:\Anaconda\lib\site-packages\scrapy\utils\defer.py", line 102, in 
iter_errback
yield next(it)
File "Z:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py", 
line 30, in process_spider_output
for x in result:
File "Z:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\referer.py", 
line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
File "Z:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\urllength.py", 
line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
File "Z:\Anaconda\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 
58, in <genexpr>
    return (r for r in result or () if _filter(r))
File 
"C:\Users\jayes\myredditscraper\myredditscraper\spiders\scrapereddit.py", 
line 28, in parse
   yield Request(url=(next_page),callback=self.parse)
File "Z:\Anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 
25, in __init__
   self._set_url(url)
File "Z:\Anaconda\lib\site-packages\scrapy\http\request\__init__.py", line 
62, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url:
2018-04-29 00:01:12 [scrapy.core.engine] INFO: Closing spider (finished)

And the error comes at random pages each time the spider is run, making it impossible for me to detect what's causing the problem. Here's my redditscraper.py file which contains the code(I've also used Pipeline and Items.py but that doesn't contain any problems I feel)

import scrapy
import time
from scrapy.http.request import Request
from myredditscraper.items import MyredditscraperItem
class ScraperedditSpider(scrapy.Spider):

name = 'scrapereddit'
allowed_domains = ['www.reddit.com']
start_urls = ['http://www.reddit.com/r/india/']
def parse(self,response):
    next_page=''
    titles=response.css("a.title::text").extract()
    links=response.css("a.title::attr(href)").extract()
    votes=response.css("div.score.unvoted::attr(title)").extract()
    for item in zip(titles,links,votes):

        new_item = MyredditscraperItem()
        new_item['title']=item[0]
        new_item['link']=item[1]
        new_item['vote']=item[2]
        yield new_item

        next_page = response.css("span.next- 
button").css('a::attr(href)').extract()[0]

    if next_page is not None:

        yield Request(url=(next_page),callback=self.parse)




Aucun commentaire:

Enregistrer un commentaire