jeudi 10 septembre 2020

Scrapy Python Web Scraping JSON

I am struggling trying to figure out how to scrape JSON response using Scrapy Python. I was able to successfully scrape JSON on a different page on the same site. I would appreciate any help.

How would I scrape values in "tournamentGroup" (i.e. id, name) as well as year, title, etc.

Partial Code:

start_url = 'https://api.wtatennis.com/tennis/tournaments/?page=0&pageSize=100&excludeLevels=ITF&from=2020-09-01&to=2020-09-30'
    
with urllib.request.urlopen(start_url) as start_url:
    json_obj = start_url.read()
    rank_list = json.loads(json_obj)

    for item in rank_list:
        
        rank_data = []
        tourney_id = item['content']['id']
        tourney_year = item['year']
    
        rank_data = [tourney_id, tourney_year]
 
        cur.execute("""insert into wta_rankings(tourney_id, tourney_year) 
                    values(%s, %s)
                    ON CONFLICT DO NOTHING"""
                    ,(rank_data))
        conn.commit()        
    cur.close()

JSON:
{
   "pageInfo":{
      "page":0,
      "numPages":0,
      "pageSize":100,
      "numEntries":10
   },
   "content":[
      {
         "tournamentGroup":{
            "id":2023,
            "name":"Prague 125K",
            "level":"125K",
            "metadata":null
         },
         "year":2020,
         "title":"Prague Open",
         "startDate":"2020-08-29",
         "endDate":"2020-09-06",
         "surface":"Clay",
         "inOutdoor":"O",
         "city":"PRAGUE",
         "country":"Czech Republic",
         "singlesDrawSize":128,
         "doublesDrawSize":32,
         "prizeMoney":3125000,
         "prizeMoneyCurrency":"USD",
         "liveScoringId":"2023"
      },

URL Example: https://api.wtatennis.com/tennis/tournaments/?page=0&pageSize=100&excludeLevels=ITF&from=2020-09-01&to=2020-09-30




Aucun commentaire:

Enregistrer un commentaire