mardi 27 octobre 2020

Python Scrapy text() function unable to get empty td element

I'm very, VERY new to web scraping and I'm still learning as I go. Currently, I'm using Python and Scrapy to build my own web scraper but I encountered something really odd.

I tried to go to scrape this webpage right here just as an exercise: https://worldpopulationreview.com/countries/countries-by-national-debt

That's basically a webpage which lists the Debt to GDP ratio for various countries in the world. Now, if you noticed, Sudan does not have any Population number recorded in the table on that web page.

Empty Population for Sudan

I tried to scrape the population for each country from that web page using this xpath expression:

import scrapy
import pandas as pd


class GdpDebtSpider(scrapy.Spider):
    name = 'gdp_debt'
    allowed_domains = ['worldpopulationreview.com']
    start_urls = ['https://worldpopulationreview.com/countries/countries-by-national-debt/']

    def parse(self, response):

        populations = response.xpath("//tbody/tr/td[3]/text()").getall()

The problem here is that it seems like with the xpath expression above which is

"//tbody/tr/td[3]/text()"

it's unable to capture the empty population table cell in Sudan, it basically skips the population of Sudan entirely because I believe the td element does not contain any text node.

Is there any solution to this where we can extract elements without any text node as an empty string like this: '' instead of skipping it entirely?

Thanks so much everyone!




Aucun commentaire:

Enregistrer un commentaire